language and vision day 4 lecture 3
TRANSCRIPT
![Page 1: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/1.jpg)
Day 4 Lecture 3
Language and Vision
Xavier Giroacute-i-Nieto
2
Acknowledgments
Santi Pascual
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 2: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/2.jpg)
2
Acknowledgments
Santi Pascual
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 3: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/3.jpg)
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 4: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/4.jpg)
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 5: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/5.jpg)
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 6: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/6.jpg)
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 7: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/7.jpg)
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 8: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/8.jpg)
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 9: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/9.jpg)
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 10: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/10.jpg)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 11: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/11.jpg)
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 12: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/12.jpg)
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 13: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/13.jpg)
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 14: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/14.jpg)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 15: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/15.jpg)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 16: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/16.jpg)
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 17: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/17.jpg)
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 18: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/18.jpg)
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 19: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/19.jpg)
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 20: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/20.jpg)
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 21: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/21.jpg)
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 22: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/22.jpg)
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 23: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/23.jpg)
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 24: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/24.jpg)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 25: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/25.jpg)
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 26: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/26.jpg)
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 27: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/27.jpg)
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 28: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/28.jpg)
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 29: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/29.jpg)
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 30: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/30.jpg)
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 31: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/31.jpg)
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 32: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/32.jpg)
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 33: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/33.jpg)
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 34: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/34.jpg)
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 35: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/35.jpg)
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 36: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/36.jpg)
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 37: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/37.jpg)
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 38: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/38.jpg)
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 39: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/39.jpg)
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 40: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/40.jpg)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 41: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/41.jpg)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 42: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/42.jpg)
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 43: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/43.jpg)
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 44: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/44.jpg)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 45: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/45.jpg)
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 46: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/46.jpg)
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 47: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/47.jpg)
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 48: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/48.jpg)
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 49: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/49.jpg)
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 50: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/50.jpg)
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 51: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/51.jpg)
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 52: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/52.jpg)
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
![Page 53: Language and Vision Day 4 Lecture 3](https://reader031.vdocument.in/reader031/viewer/2022021717/620bddebb165de6bb0624e13/html5/thumbnails/53.jpg)
53
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi