using images to ground machine translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf ·...
TRANSCRIPT
![Page 1: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/1.jpg)
Using Images to Ground Machine Translation
Iacer Calixto
December 7, 2017
ADAPT Centre, School of Computing, Dublin City UniversityDublin, [email protected]
1 / 52
![Page 2: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/2.jpg)
Outline
Introduction
NMT and IDG Architectures
Multi-modal MT Shared Task(s)
Our MMT Models
Experiments
2 / 52
![Page 3: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/3.jpg)
Introduction
![Page 4: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/4.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
4 / 52
![Page 5: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/5.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
5 / 52
![Page 6: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/6.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
6 / 52
![Page 7: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/7.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
7 / 52
![Page 8: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/8.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
8 / 52
![Page 9: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/9.jpg)
Introduction [1/2]
• Machine Translation (MT): the task in which we wish to learn amodel to translate text from one natural language (e.g., English)into another (e.g., German).• text-only task;• model is trained on parallel source/target sentence pairs.
• Image description generation (IDG): the task in which we wish tolearn a model to describe an image using natural language (e.g.,German).• multi-modal task (text and vision);• model is trained on image/target sentence pairs.
9 / 52
![Page 10: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/10.jpg)
Introduction [2/2]
• Multi-Modal Machine Translation (MMT): learn a model totranslate text and an image that illustrates this text from onenatural language (e.g., English) into another (e.g., German).• multi-modal task (text and vision);• model is trained on source/image/target triplets;• can be seen as a form of augmented MT or augmented image
description generation.
10 / 52
![Page 11: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/11.jpg)
Introduction [2/2]
• Multi-Modal Machine Translation (MMT): learn a model totranslate text and an image that illustrates this text from onenatural language (e.g., English) into another (e.g., German).• multi-modal task (text and vision);• model is trained on source/image/target triplets;• can be seen as a form of augmented MT or augmented image
description generation.
11 / 52
![Page 12: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/12.jpg)
Introduction [2/2]
• Multi-Modal Machine Translation (MMT): learn a model totranslate text and an image that illustrates this text from onenatural language (e.g., English) into another (e.g., German).• multi-modal task (text and vision);• model is trained on source/image/target triplets;• can be seen as a form of augmented MT or augmented image
description generation.
12 / 52
![Page 13: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/13.jpg)
Introduction [2/2]
• Multi-Modal Machine Translation (MMT): learn a model totranslate text and an image that illustrates this text from onenatural language (e.g., English) into another (e.g., German).• multi-modal task (text and vision);• model is trained on source/image/target triplets;• can be seen as a form of augmented MT or augmented image
description generation.
13 / 52
![Page 14: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/14.jpg)
Use Cases
• Multi-Modal Machine Translation (MMT) use-cases:• localisation of product information in e-commerce,
e.g. eBay, Amazon;
• localisation of user posts and photos in social networks,e.g. Facebook, Instagram, Twitter;
• translation of image descriptions in general;
• translation of subtitles (video), etc.
14 / 52
![Page 15: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/15.jpg)
Use Cases
• Multi-Modal Machine Translation (MMT) use-cases:• localisation of product information in e-commerce,
e.g. eBay, Amazon;
• localisation of user posts and photos in social networks,e.g. Facebook, Instagram, Twitter;
• translation of image descriptions in general;
• translation of subtitles (video), etc.
15 / 52
![Page 16: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/16.jpg)
Use Cases
• Multi-Modal Machine Translation (MMT) use-cases:• localisation of product information in e-commerce,
e.g. eBay, Amazon;
• localisation of user posts and photos in social networks,e.g. Facebook, Instagram, Twitter;
• translation of image descriptions in general;
• translation of subtitles (video), etc.
16 / 52
![Page 17: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/17.jpg)
Use Cases
• Multi-Modal Machine Translation (MMT) use-cases:• localisation of product information in e-commerce,
e.g. eBay, Amazon;
• localisation of user posts and photos in social networks,e.g. Facebook, Instagram, Twitter;
• translation of image descriptions in general;
• translation of subtitles (video), etc.
17 / 52
![Page 18: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/18.jpg)
Use Cases
• Multi-Modal Machine Translation (MMT) use-cases:• localisation of product information in e-commerce,
e.g. eBay, Amazon;
• localisation of user posts and photos in social networks,e.g. Facebook, Instagram, Twitter;
• translation of image descriptions in general;
• translation of subtitles (video), etc.
18 / 52
![Page 19: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/19.jpg)
Convolutional Neural Networks (CNN)
• Virtually all MMT and IDG models use pre-trained CNNs forimage feature extraction;
• Illustration of the VGG19 network (Simonyan and Zisserman,2014):
Figure 1: https://goo.gl/y0So1l19 / 52
![Page 20: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/20.jpg)
Example CNNs
(a) https://goo.gl/jqQEvg
(b) Illustration of a residualconnection (He et al., 2015).
20 / 52
![Page 21: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/21.jpg)
NMT and IDG Architectures
![Page 22: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/22.jpg)
Neural Machine Translation
The attention mechanism lets the decoder search for the best sourcewords to generate each target word, e.g. Bahdanau et al., 2015.
22 / 52
![Page 23: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/23.jpg)
Neural Image Description Generation
The attention mechanism lets the decoder look at or attend to specificparts of the image when generating each target word, e.g. Xu et al.,2015.
23 / 52
![Page 24: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/24.jpg)
Multi-modal MT Shared Task(s)
![Page 25: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/25.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
25 / 52
![Page 26: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/26.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
26 / 52
![Page 27: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/27.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
27 / 52
![Page 28: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/28.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
28 / 52
![Page 29: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/29.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
29 / 52
![Page 30: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/30.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
30 / 52
![Page 31: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/31.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
31 / 52
![Page 32: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/32.jpg)
Multimodal MT Shared Tasks – overall ideas
• 3 types of submissions:• Two attention mechanisms: compute context vectors over the
source language hidden states and location-preserving imagefeatures;
• Encoder and/or decoder initialisation: initialise encoder and/ordecoder RNNs with bottleneck image features;
• Other alternatives:• element-wise multiplication of the target-language embeddings with
bottleneck image features;
• sum source-language word embeddings with bottleneck imagefeatures;
• use visual features in a retrieval framework;
• visually-ground encoder representations by learning to predictbottleneck image features from the source-language hidden states.
http://mtm2017.unbabel.com/assets/images/slides/lucia_specia.pdf
32 / 52
![Page 33: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/33.jpg)
Heidelberg University (Hitschler et al., 2016)
33 / 52
![Page 34: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/34.jpg)
CMU (Huang et al., 2016) [1/3]
34 / 52
![Page 35: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/35.jpg)
CMU (Huang et al., 2016) [2/3]
35 / 52
![Page 36: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/36.jpg)
CMU (Huang et al., 2016) [3/3]
36 / 52
![Page 37: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/37.jpg)
UvA-TiCC (Elliott and Kadar, 2017)
37 / 52
![Page 38: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/38.jpg)
LIUM-CVC (Caglayan et al., 2017)
• Global visual features, i.e. 2048D pool5 features fromResNet-50, are multiplicatively interacted with the target wordembeddings;
• With 128D embeddings and 256D recurrent layers, their resultingmodels have ∼5M parameters.
(Elliott et al., 2017)
38 / 52
![Page 39: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/39.jpg)
LIUM-CVC (Caglayan et al., 2017)
• Global visual features, i.e. 2048D pool5 features fromResNet-50, are multiplicatively interacted with the target wordembeddings;
• With 128D embeddings and 256D recurrent layers, their resultingmodels have ∼5M parameters.
(Elliott et al., 2017)
39 / 52
![Page 40: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/40.jpg)
Our MMT Models
![Page 41: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/41.jpg)
Doubly-Attentive Multi-Modal NMT – NMTSRC+IMG
Figure 3: Doubly-Attentive Multi-modal NMT (Calixto et al., 2017a)
image gating41 / 52
![Page 42: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/42.jpg)
Image as source-language words – IMGW
• IMGW – Global visual features are projected into thesource-language word embeddings space and used as thefirst/last word in the source sequence.
(Calixto et al., 2017b)
42 / 52
![Page 43: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/43.jpg)
Image for encoder initialisation – IMGE
• IMGE – Global visual features are projected into thesource-language RNN hidden states space and used tocompute the initial state of the source-language RNN.
(Calixto et al., 2017b)
43 / 52
![Page 44: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/44.jpg)
Image for decoder initialisation – IMGD
• IMGD – Global visual features are projected into thetarget-language RNN hidden states space and used asadditional data to compute the initial state of thetarget-language RNN. (Calixto et al., 2017b)
44 / 52
![Page 45: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/45.jpg)
Experiments
![Page 46: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/46.jpg)
English–German [1/2]
• Training data: Multi30k data set (Elliott et al., 2016).
Model Training BLEU4↑ METEOR↑ TER↓ chrF3↑data
NMT M30kT 33.7 52.3 46.7 65.2PBSMT M30kT 32.9 54.3† 45.1† 67.4Huang et al., 2016 M30kT 35.1 (↑ 1.4) 52.2 (↓ 2.1) — —
+ RCNN 36.5 (↑ 2.8) 54.1 (↓ 0.2) — —
NMTSRC+IMG M30kT 36.5†‡ (↑ 2.8) 55.0†† (↑ 0.9) 43.7†‡ (↓ 1.4) 67.3 (↓ 0.1)IMGW M30kT 36.9†‡ (↑ 3.2) 54.3†‡ (↑ 0.2) 41.9†‡ (↓ 3.2) 66.8 (↓ 0.6)IMGE M30kT 37.1†‡ (↑ 3.4) 55.0†‡ (↑ 0.9) 43.1†‡ (↓ 2.0) 67.6 (↑ 0.2)IMGD M30kT 37.3†‡ (↑ 3.6) 55.1†‡ (↑ 1.0) 42.8†‡ (↓ 2.3) 67.7 (↑ 0.3)
46 / 52
![Page 47: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/47.jpg)
English–German [2/2]
• Pre-training on back-translated comparable Multi30k data set(Elliott et al., 2016).
Model Training BLEU4↑ METEOR↑ TER↓ chrF3↑data
PBSMT (LM) M30kT 34.0 ↑ 0.0 55.0† ↑ 0.0 44.7 ↑ 0.0 68.0NMT M30kT 35.5‡ ↑ 0.0 53.4 ↑ 0.0 43.3‡ ↑ 0.0 65.2
NMTSRC+IMG M30kT 37.1†‡ (↑ 1.6) 54.5†‡ (↓ 0.5) 42.8†‡ (↓ 0.5) 66.6 (↓ 1.4)IMGW M30kT 36.7†‡ (↑ 1.2) 54.6†‡ (↓ 0.4) 42.0†‡ (↓ 1.3) 66.8 (↓ 1.2)IMGE M30kT 38.5†‡ (↑ 3.0) 55.7†‡ (↑ 0.9) 41.4†‡ (↓ 1.9) 68.3 (↑ 0.3)IMGD M30kT 38.5†‡ (↑ 3.0) 55.9†‡ (↑ 1.1) 41.6†‡ (↓ 1.7) 68.4 (↑ 0.4)
47 / 52
![Page 48: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/48.jpg)
German–English [1/2]
• Training data: Multi30k data set (Elliott et al., 2016).
Model BLEU4↑ METEOR↑ TER↓ chrF3↑
PBSMT 32.8 34.8 43.9 61.8NMT 38.2 35.8 40.2 62.8
NMTSRC+IMG 40.6†‡ (↑ 2.4) 37.5†‡ (↑ 1.7) 37.7†‡ (↓ 2.5) 65.2 (↑ 2.4)IMGW 39.5†‡ (↑ 1.3) 37.1†‡ (↑ 1.3) 37.1†‡ (↓ 3.1) 63.8 (↑ 1.0)IMGE 41.1†‡ (↑ 2.9) 37.7†‡ (↑ 1.9) 37.9†‡ (↓ 2.3) 65.7 (↑ 2.9)IMGD 41.3†‡ (↑ 3.1) 37.8†‡ (↑ 2.0) 37.9†‡ (↓ 2.3) 65.7 (↑ 2.9)
48 / 52
![Page 49: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/49.jpg)
German–English [2/2]
• Pre-training on back-translated comparable Multi30k data set(Elliott et al., 2016).
Model BLEU4↑ METEOR↑ TER↓ chrF3↑
PBSMT 36.8 36.4 40.8 64.5NMT 42.6 38.9 36.1 67.6
NMTSRC+IMG 43.2‡† (↑ 0.6) 39.0‡† (↑ 0.1) 35.5‡† (↓ 0.6) 67.7 (↑ 0.1)IMG2W 42.4†‡ (↓ 0.2) 39.0†‡ (↑ 0.1) 34.7†‡ (↓ 1.4) 67.6 (↑ 0.0)IMGE 43.9†‡ (↑ 1.3) 39.7†‡ (↑ 0.8) 34.8†‡ (↓ 1.3) 68.6 (↑ 1.0)IMGD 43.4†‡ (↑ 0.8) 39.3†‡ (↑ 0.4) 35.2†‡ (↓ 0.9) 67.8 (↑ 0.2)
49 / 52
![Page 50: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/50.jpg)
NMTSRC+IMG — Visualisation of attention states
(a) Image–target word alignments. (b) Source–target word alignments.
50 / 52
![Page 51: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/51.jpg)
References I
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In InternationalConference on Learning Representations. ICLR 2015.
Caglayan, O., Aransa, W., Bardet, A., Garcıa-Martınez, M., Bougares, F., Barrault, L., Masana, M., Herranz, L., and van de Weijer, J.(2017). LIUM-CVC Submissions for WMT17 Multimodal Translation Task. In Proceedings of the Second Conference on MachineTranslation, Volume 2: Shared Task Papers, pages 432–439.
Calixto, I., Liu, Q., and Campbell, N. (2017a). Doubly-Attentive Decoder for Multi-modal Neural Machine Translation. In Proceedings of the55th Conference of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1913–1924, Vancouver, Canada.
Calixto, I. and Liu, Q. (2017b). Incorporating Global Visual Features into Attention-based Neural Machine Translation. In Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing, pages 1003–1014, Copenhagen, Denmark.
Elliott, D., Frank, S., Sima’an, K., and Specia, L. (2016). Multi30K: Multilingual English-German Image Descriptions. In Proceedings of the5th Workshop on Vision and Language, VL@ACL 2016, Berlin, Germany.
Elliott, D., Kadar, A. (2017). Imagination improves Multimodal Translation. arXiv preprint arXiv:1705.04350.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385.
Hitschler, J., Schamoni, S., and Riezler, S. (2016). Multimodal Pivots for Image Caption Translation. In Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2399–2409, Berlin, Germany.
Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. (2016). Attention-based multimodal neural machine translation. In Proceedings ofthe First Conference on Machine Translation, pages 639–645, Berlin, Germany.
Simonyan, K. and Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural imagecaption generation with visual attention. In Blei, D. and Bach, F., editors, Proceedings of the 32nd International Conference on MachineLearning (ICML-15), pages 2048–2057. JMLR Workshop and Conference Proceedings.
51 / 52
![Page 52: Using Images to Ground Machine Translationai-nlp-ml/giancourse/nmt/images-ground-machine.pdf · Multi-Modal Machine Translation (MMT):learn a modelto translate textandan image that](https://reader033.vdocument.in/reader033/viewer/2022060223/5f07e9507e708231d41f6052/html5/thumbnails/52.jpg)
Thank you!Questions?
52 / 52