self-critical reasoningfor robust visual question answering · 2020. 1. 14. · jialinwu and...

11
Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney

Upload: others

Post on 14-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Self-Critical Reasoning for Robust Visual Question Answering

Jialin Wu and Raymond J. Mooney

Page 2: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Visual Question Answering (VQA)

• Common VQA system

What utensil is pictured?

Answer Prediction

Knife(0.72)

Fork(0.66)Visual feature set 𝒱Original image

Page 3: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Capture superficial statistical correlationsbetween QA pairs

VQAsystem Knife

I won’t bother to look at the image, Ican answer your question by just

looking at the questionWhat utensil is pictured?

Original image 0

20

40

60

80

100

knife fork

Training Answer Distribution

Page 4: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Force VQA to focus on what humans focus on

• Extract a proposal set of objects ( ) that humans focus on.

OR

There is a fork near the cake.

Proposal object set

Human visual explanation

Human textual explanation

Page 5: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Force VQA to focus on what humans focus on

• Enforce the gradients for the correct answer to have the largest valuefor at least one of the extracted objects.

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

Proposal object set

Influence Strengthen Loss

Page 6: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Results

• Compared to baseline model on VQA-CP dataset• VQA-CP dataset manually set the train and test set in very different

distribution

38

43

48

53

All

VQA scores

Baseline Ours (infl)

Page 7: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Over sensitivity to the most common objects

VQAsystem

I can focus on the fork but I stillthink it is a knife

What utensil is pictured?

Knife

Focused objectsfor answer “fork”

Focused objectsfor answer “knife”

Page 8: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Criticizing the false influential object

• Find the most influential object for the correct answer using gradients

What utensil is pictured?

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

OR

There is a fork near the cake.

Answer Prediction

Knife(0.72)

Fork(0.66)

Proposal object set

Explaining prediction “fork”

Visual feature set 𝒱Original image

Human visual explanation

Human textual explanation

The most influential object

Page 9: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Criticizing the false influential object

• Force the object to contribute more to the correct answer.

What utensil is pictured?

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

OR

There is a fork near the cake.

Answer Prediction

Knife(0.72)

Fork(0.66)

Proposal object set

Explaining prediction “fork”

Visual feature set 𝒱Original image

Human visual explanation

Human textual explanation

The most influential object

∇#𝑝(𝑘𝑛𝑖𝑓𝑒|𝑄, 𝒱)

Explaining prediction “knife”

Self Critical Loss

Page 10: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Our self-critical approach

VQAsystem Fork

Oh, yes, the utensil should be a fork.What utensil is pictured?

Page 11: Self-Critical Reasoningfor Robust Visual Question Answering · 2020. 1. 14. · JialinWu and Raymond J. Mooney. VisualQuestionAnswering(VQA) •CommonVQAsystem What utensil is pictured?

Results

• Compared to baseline model on VQA-CP dataset

3840424446485052

All

VQA scores

Baseline Ours (infl) Ours (infl + crit)