language and vision day 4 lecture 3

53
Day 4 Lecture 3 Language and Vision Xavier Giró-i-Nieto

Upload: others

Post on 16-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language and Vision Day 4 Lecture 3

Day 4 Lecture 3

Language and Vision

Xavier Giroacute-i-Nieto

2

Acknowledgments

Santi Pascual

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 2: Language and Vision Day 4 Lecture 3

2

Acknowledgments

Santi Pascual

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 3: Language and Vision Day 4 Lecture 3

3

In lecture D2L6 RNNs

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)

Language IN

Language OUT

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 4: Language and Vision Day 4 Lecture 3

4

Motivation

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 5: Language and Vision Day 4 Lecture 3

5

Much earlier than lecture D2L6 RNNs

Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 6: Language and Vision Day 4 Lecture 3

6

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Representation or Embedding

For clarity letrsquos study a Neural Machine Translation (NMT) case

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 7: Language and Vision Day 4 Lecture 3

7

Encoder One-hot encoding

One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed

Word Binary One-hot encoding

zero 00 0000

one 01 0010

two 10 0100

three 11 1000

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 8: Language and Vision Day 4 Lecture 3

8

Encoder One-hot encoding

Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)

Word One-hot encoding

economic 000010

growth 001000

has 100000

slowed 000001

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 9: Language and Vision Day 4 Lecture 3

Encoder One-hot encoding

One-hot is a very simple representation every word is equidistant from every other word

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 10: Language and Vision Day 4 Lecture 3

10

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM WiE

The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights

K

K

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 11: Language and Vision Day 4 Lecture 3

11

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

siM Wi

Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process

K

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 12: Language and Vision Day 4 Lecture 3

12

Encoder Projection to continious space

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

Sequence of continious-space

word representations

Sequence of words

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 13: Language and Vision Day 4 Lecture 3

13

Encoder Recurrence

Sequence

Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 14: Language and Vision Day 4 Lecture 3

14

Encoder Recurrence

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 15: Language and Vision Day 4 Lecture 3

15

Encoder Recurrence

time

time

Front View Side View

Rotation 90o

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 16: Language and Vision Day 4 Lecture 3

16

Encoder RecurrenceFront View

Rotation 90o

Side View

Representation or embedding of the sentence

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 17: Language and Vision Day 4 Lecture 3

17

Sentence Embedding

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 18: Language and Vision Day 4 Lecture 3

18

(Word Embeddings)

Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 19: Language and Vision Day 4 Lecture 3

19

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 20: Language and Vision Day 4 Lecture 3

20

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

With zi ready we can score each word k in the vocabulary with a dot product

RNN internal

state

Neuron weights for

word k

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 21: Language and Vision Day 4 Lecture 3

21

Decoder

Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989

and finally normalize to word probabilities with a softmax

Score for word k

Probability that the ith word is word k

Previous words Hidden state

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 22: Language and Vision Day 4 Lecture 3

22

Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted

EOS

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 23: Language and Vision Day 4 Lecture 3

23

Encoder-Decoder

Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 24: Language and Vision Day 4 Lecture 3

24

Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate

Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 25: Language and Vision Day 4 Lecture 3

25

Encoder-Decoder Seq2Seq

Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 26: Language and Vision Day 4 Lecture 3

26

Encoder-Decoder Beyond text

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 27: Language and Vision Day 4 Lecture 3

27

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 28: Language and Vision Day 4 Lecture 3

28

Captioning DeepImageSent

(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015

only takes into accountimage features in the firsthidden state

Multimodal Recurrent Neural Network

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 29: Language and Vision Day 4 Lecture 3

29

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 30: Language and Vision Day 4 Lecture 3

30

Captioning Show amp Tell

Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 31: Language and Vision Day 4 Lecture 3

31

Captioning LSTM for image amp video

Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 32: Language and Vision Day 4 Lecture 3

32

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

Captioning (+ Detection) DenseCap

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 33: Language and Vision Day 4 Lecture 3

33

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 34: Language and Vision Day 4 Lecture 3

34

Captioning (+ Detection) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo

AMAIArdquoa woman wearing a black shirtrdquo ldquo

BOTH ldquotwo men wearing black glassesrdquo

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 35: Language and Vision Day 4 Lecture 3

35

Captioning (+ Retrieval) DenseCap

Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 36: Language and Vision Day 4 Lecture 3

36

Captioning HRNE

( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016

LSTM unit (2nd layer)

Time

Image

t = 1 t = T

hidden stateat t = T

first chunkof data

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 37: Language and Vision Day 4 Lecture 3

37

Visual Question Answering

[z1 z2 hellip zN] [y1 y2 hellip yM]

ldquoIs economic growth decreasing rdquo

ldquoYesrdquo

EncodeEncode

Decode

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 38: Language and Vision Day 4 Lecture 3

38

Extract visual features

Embedding

Predict answerMerge

Question

What object is flying

AnswerKite

Visual Question Answering

Slide credit Issey Masuda

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 39: Language and Vision Day 4 Lecture 3

39

Visual Question Answering

Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016

Dynamic Parameter Prediction Network (DPPnet)

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 40: Language and Vision Day 4 Lecture 3

40

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 41: Language and Vision Day 4 Lecture 3

41

Visual Question Answering Dynamic

(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016

Main idea split image into local regions Consider each region equivalent to a sentence

Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors

Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 42: Language and Vision Day 4 Lecture 3

42

Visual Question Answering Grounded

(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 43: Language and Vision Day 4 Lecture 3

43

Datasets Visual Genome

Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 44: Language and Vision Day 4 Lecture 3

44

Datasets Microsoft SIND

Microsoft SIND

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 45: Language and Vision Day 4 Lecture 3

45

Challenge Microsoft Coco

Captioning

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 46: Language and Vision Day 4 Lecture 3

46

Challenge Storytelling

Storytelling

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 47: Language and Vision Day 4 Lecture 3

47

Challenge Movie Description

Movie Description Retrieval and Fill-in-the-blank

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 48: Language and Vision Day 4 Lecture 3

48

Challenges Movie Question Answering

Movie Question Answering

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 49: Language and Vision Day 4 Lecture 3

49

Challenges Visual Question Answering

Visual Question Answering

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 50: Language and Vision Day 4 Lecture 3

50

1000

Humans

8330

UC Berkeley amp Sony

6647

Baseline LSTMampCNN

5406

Baseline Nearest neighbor

4285

Baseline Prior per question type

3747

Baseline All yes

2988

5362

I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]

Challenges Visual Question Answering

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 51: Language and Vision Day 4 Lecture 3

51

Summary Embedding language and vision into semantic embeddings

allows fusion learning

Very high interest among researchers Great topic for your

thesis

Will vision and language (and multimedia) communities be

merged with (absorbed by) the machine learning one

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 52: Language and Vision Day 4 Lecture 3

52

Conclusions

New Turing test How to evaluate AIrsquos image understanding

Slide credit Issey Masuda

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi

Page 53: Language and Vision Day 4 Lecture 3

53

Thanks QampA Follow me at

httpsimatgeupceduwebpeoplexavier-giro

DocXaviProfessorXavi