vision and language representation...

53
Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning 1 1 Jiasen Lu April 21, 2020

Upload: others

Post on 06-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Vision and Language Representation Learning

– Self Supervised Pretraining and Multi-Task Learning

1 1

Jiasen Lu

April 21, 2020

Page 2: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Image CaptioningVisual Question Answering

Visual Commonsense Reasoning Refer Expression

Vision and Language

2[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Page 3: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Refer Expression

Image CaptioningVisual Question Answering

Visual Commonsense Reasoning

Vision and Language

Visual Commonsense Reasoning

3[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Page 4: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Visual Grounding

C: A bunch of red and yellowflowers on a branch.

Q: What type of plant is this?

A: Banana

[Shen et.al 2018]

Common model for visual grounding and leverage them on a wide array of vision-and-language tasks

4

Page 5: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Pretrain-Transfer

Object Detection Semantic Segmentation Pose Estimation

Question Answering Commonsense Inference Sentiment Analysis

[Deng et.al 2009, Devlin 2018] 5

Page 6: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Pretrain-Transfer

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

• Aligned image-caption pairs.

• 3.3 million image compared to 0.12 million in COCOcaption.

• Automatically collected.

[Sharma et.al 2018] 6

Page 7: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

BERT

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<SEP> Tok1 Tok 2 <SEP>

Sentence B

<CLS> Tok 1 Tok2 Tok N…

Sentence A

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

7[Sharma et.al 2018, Devlin et.al 2018]]

Page 8: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

ViLBERT

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<SEP> Tok1 Tok 2 <SEP>

Sentence B

<CLS> Tok 1 Tok2 Tok N…

Sentence A

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<MASK

><MASK

>

T1T2

8[Sharma et.al 2018, Devlin et.al 2018]]

Page 9: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Single-Stream model

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<SEP> Tok1 Tok 2 <SEP>

Sentence

<CLS> …

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

9[Sharma et.al 2018, Devlin et.al 2018]]

Page 10: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Single-Stream model

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee.

Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<SEP> Tok1 Tok 2 <SEP>

Sentence

<CLS> …

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<MASK

>

<MASK

>

T1

10[Sharma et.al 2018, Devlin et.al 2018]]

Page 11: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

ViLBERT

Problem: Different modalities may require different level of abstractions.

• Linguistic stream:

artist

Linear

• Visual stream:

[He et.al. 2015] 11

Page 12: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

ViLBERT

Solution: two-stream model which process visual and linguistic separately.

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

Tok1 Tok2 <SEP>

Sentence

<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT 𝑙 - layers𝑘 - layers

12

Page 13: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

ViLBERT

Problem: how to fuse two different modality?

Solution: use co-attention [Lu et.al. 2016] to fuse information between differentsource.

13

Page 14: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

ViLBERT

14

Co-attention [Lu et.al. 2016] to fuse information between different source.

Page 15: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Pre-training Objective

Masked multi-modal modelling

Multi-modal alignment prediction

• Follows masked LM in BERT.

• 15% of the words or image regions to predict.

• Linguistic stream:

o 80% of the time, replace with [MASK].

o 10% of the time, replace random word.

o 10% of the time, keep same.

• Visual stream:

o 80% of the time, replace with zero vector.

• Predict whether image and caption is aligned or not

15

Page 16: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Visualizations

A boat covered in flowers near the market.

[Sharma et.al 2018] 16

Page 17: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Sentence→ Image

Layer 0

Layer 5

H0 H7

BertVis https://github.com/jessevig/bertviz 17

Page 18: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Sentence→ Image

Layer 0

Layer 5

H0 H7

BertVis https://github.com/jessevig/bertviz 18

Page 19: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Image→ Sentence

Layer 0

Layer 5

H0 H7

BertVis https://github.com/jessevig/bertviz 19

Page 20: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Image → Sentence

Layer 0

Layer 5

H0 H7

BertVis https://github.com/jessevig/bertviz 20

Page 21: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Pre-training Fine-Tuning

Image and text pair from conceptual caption

Vision & Language BERT

ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯

<IMG>

ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇

<CLS> Man shopping for <SEP>

… …

…<MASK

>

<MASK

>

<MASK

><MASK>

Man shopping

Masked SentenceMasked Region

Image Question Pair

Vision & Language BERT

ℎ𝑉0 ℎ𝑉1 ℎ𝑉2 ℎ𝑉3 ℎ𝑉𝒯

<IMG>

ℎ𝐿0 ℎ𝐿1 ℎ𝐿2 ℎ𝐿3 ℎ𝐿𝑇

<CLS> What is the <SEP>

… …

QuestionImage

shopping

VQAVCRRefer

Fine-tuning Procedure

21

Page 22: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Tasks

[Antol 2015, zeller 2018, Yu 2016, Plummer 2015]22

Page 23: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Results

70.22

65.9

68.85 68.93

70.55

test-dev

VQA

43.1

47.27

52.73

49.48

54.04

val

VCR Q->A

23

65.3365.64

69.2168.61

72.34

val

RefCOCO+

48.6

45.5

58.2

test

Image Retrieval

Page 24: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Concurrent Work

24[Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

Page 25: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Summary

Summary

Task-agnostic visiolinguistic representations pretraining for visual grounding

• Introduce pretrain-transfer to vision and language tasks.

• Achieve SOTA on multiple vision and language tasks.

Limitations

The model can still learn inconsistent grounding by task specific finetuning.

• Training multiple vision and language task together – multi-task V&L

25

Page 26: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

One Model for V&L:ViLBERT Problems:

• Inconsistent grounding by task specific

finetuning.

• Four V&L tasks.

• Model is huge, overfitting.

What we want:

• Test on more tasks.

• Consistent Grounding across tasks.

• Explore the limit of the model.

26

VQA• VQA• Genome QA• GQA

Image Description• Caption based

Retrieval (COCO)• Caption based

Retrieval (COCO)

Referring Expression• Ref COCO• Ref COCO+• Ref COCOg• Visual 7w• GuessWhat

V&L Verification• NLVR2• Visual Entailment

Page 27: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

27

Model improvements over ViLBERT

Page 28: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

28

Model improvements over ViLBERT

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

Tok1 Tok2 <SEP>

Aligned caption

<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT

• Masked multi-modal modelling only for aligned image caption pairs.

<MASK>

Page 29: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

29

Model improvements over ViLBERT

• Masked multi-modal modelling only for aligned image caption pairs.

• Masking overlapped image regions (IOU > 0.4).

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

Tok1 Tok2 <SEP>

…<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT

<MASK>

Aligned caption

Page 30: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

30

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<TSK> Tok1 <SEP>

…<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT

Aligned caption

VQA/Genome QA

GQA

Retrieval

NLVR

Visual Entailment

Page 31: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

31

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<TSK> Tok1 <SEP>

…<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT

Aligned caption

VQA/Genome QA

GQA

Retrieval

NLVR

Visual Entailment

Refer Expression

Page 32: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

32

• Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

L - BERT

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇

<TSK> Tok1 <SEP>

…<CLS>…

Image

ℎ𝐿0 ℎ𝐿1 ℎ𝐿3 ℎ𝐿𝑇…

<IMG>

V- BERT

Aligned caption

• Use different head but similar tasks share the same head.

Page 33: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

33

• Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

• Dynamic Stop and Go

Page 34: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

34

• Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

• Dynamic Stop and Go

Page 35: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Multi-Task V&L Learning

35

• Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

• Use different head but similar tasks share the same head.

• Dynamic Stop and Go

Training Procedure

• Conceptual caption pre-training.

• Multi-task training.

• Finetune on single task.

Page 36: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

36

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

Model

Pretrained

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

Page 37: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

37

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

Page 38: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

38

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 39: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

39

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 40: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

40

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 41: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

41

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 42: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

42

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 43: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

43

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

2-stream

CC

1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 44: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Compare with SOTA

44

70.55

70.8

71.1671.24

72.4272.27

72.06

73.08

test-dev

VQA

70.57

69.36

71.12

72.9

73.4

74.12

test

RefCOCO+

2-streamModel

Pretrained CC

1-stream 1-stream

CC+wiki

2-stream

CC

2-stream

COCO+VG

1-stream

COCO+VG

CC+SBU

2-stream 2-stream

CC + MTCC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

Page 45: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

45

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Page 46: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

46

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Page 47: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

47

71.24

72.03 72.06

73.08

test-dev

VQA

34.1

36.18

35.05

36.38

test

Genome QA

59.09

59.659.81

60.72

test

GQA

64.8 65.06

63.02

67.36

R1

Retrieval COCO

61.46

66

63.19

66.12

R1

Retrieval Flickr

80.51

81.54

82.52

83.06

test

Visual7W

62.53

64.7864.43

65.19

test

GuessWhat

74.2574.62

77.7278.24

test

NLVR2

76.53 76.5276.63

77.32

test

SNLI-VE

69.47

72.873.4

74.12

test

RefCOCO+

Page 48: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

48

Relative Performance of

Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15

Refer Expression 0.39 0.78 - 0.24 0.47

V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

Task Performance with different Group

Page 49: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

49

Task Performance with different Group

Relative Performance of

Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15

Refer Expression 0.39 0.78 - 0.24 0.47

V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

Page 50: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Ablation Study

50

Task Performance with different Group

Relative Performance of

Trained withAverage

VQAImage

RetrievalRefer

ExpressionV&L

Verification

VQA - 0.38 0.38 -0.2 0.19

Image Retrieval 0.46 - 0.23 -4.13 -1.15

Refer Expression 0.39 0.78 - 0.24 0.47

V&L Verification 2.29 1.47 0.67 - 1.48

Average 1.04 0.88 0.43 -1.36 -

Page 51: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

DEMO

https://vilbert.cloudcv.org/

Page 52: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

Summary

Summary

Explore multi-tasks vision and language learning.

• Introduce several tricks to improve ViLBERT.

• Add <TSK> tokens to improved the multi-task learning.

• Finetune on the multi-task representation lead to new SOTA.

• Study how different groups connect with each other.

52

Potential directions

• How to use the information across tasks for XAI?

• How to incorporate more modalities?

• How to make the model smaller and more efficient?

• How to combine symbolic reasoning with representation learning.

Page 53: Vision and Language Representation Learningvalser.org/webinar/slide/slides/20200422/Valse_jiasen... · 2020-04-23 · Vision and Language Representation Learning –Self Supervised

The END

Question?

53