content based visual question answer -...

22
Content Based Visual Question Answer Badri Patro, Vinay Verma & Aatanu Samata Outline Introduction Problem Statement Literature Survey Approach Methods VQA Components Object Proposal Image CNN Sentence CNN Multimodal Convolution VQA Performance Thank You References Content Based Visual Question Answer Badri Patro, Vinay Verma & Aatanu Samata Group 30 :Project Proposal Presentation CS676A:Computer Vision and Image Processing IIT Kanpur March 12, 2016 VQA Computer Vision & Image Processing March 12, 2016 1 / 16

Upload: others

Post on 30-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Content Based Visual Question Answer

Badri Patro, Vinay Verma & Aatanu Samata

Group 30 :Project Proposal PresentationCS676A:Computer Vision and Image Processing

IIT Kanpur

March 12, 2016

VQA Computer Vision & Image Processing March 12, 2016 1 / 16

Page 2: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Outline

1 IntroductionProblem StatementLiterature Survey

2 ApproachMethodsVQA Components

Object ProposalImage CNNSentence CNNMultimodal Convolution

VQA Performance

3 References

VQA Computer Vision & Image Processing March 12, 2016 2 / 16

Page 3: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..

VQA Computer Vision & Image Processing March 12, 2016 3 / 16

Page 4: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..

VQA Computer Vision & Image Processing March 12, 2016 3 / 16

Page 5: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..

VQA Computer Vision & Image Processing March 12, 2016 3 / 16

Page 6: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..

VQA Computer Vision & Image Processing March 12, 2016 3 / 16

Page 7: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..

VQA Computer Vision & Image Processing March 12, 2016 3 / 16

Page 8: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Visual Question Answering:

Problem Statement: Given an image, and content basedquestion, find the correct answer and confidence of theanswer.

Training on a set of triplets (image, question, answer).

Free-form and open-ended(*) questions.

Answers can be single word or multiple word,depending on dataset.

VQA Computer Vision & Image Processing March 12, 2016 4 / 16

Page 9: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Problem Statement

Visual Question Answering:

Problem Statement: Given an image, and content basedquestion, find the correct answer and confidence of theanswer.

Training on a set of triplets (image, question, answer).

Free-form and open-ended(*) questions.

Answers can be single word or multiple word,depending on dataset.

VQA Computer Vision & Image Processing March 12, 2016 4 / 16

Page 10: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Literature Survey

1 Neural Image-QA(Malinowski et al.,2015)[4]

2 VSE model- VIS LSTM(Ren et al.,May,2015)[3]

VQA Computer Vision & Image Processing March 12, 2016 5 / 16

Page 11: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Introduction Literature Survey

3 mQA(Gao et al.,Nov, 2015)[2]

4 3CNN(Ma et al.,Nov,2015)[1]

VQA Computer Vision & Image Processing March 12, 2016 6 / 16

Page 12: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach Methods

1 Modified CNN,[Ma et al.]

2 Modified CNN with QuestionLSTM,[Ma et al.]

VQA Computer Vision & Image Processing March 12, 2016 7 / 16

Page 13: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach Methods

3 Modified CNN with AnswerLSTM,[Geo et al.]

4 Final Approach,[Ma et al.] & [Geo et al.]

VQA Computer Vision & Image Processing March 12, 2016 8 / 16

Page 14: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Components

Component of Proposed Algorithm

Object Proposal:Find Interest Object

Image CNN:to extract the visual representation

Question LSTM : extract the question representation

Sentence CNN: question representation

Multimodal Convolution: a fusing component tocombine the information from the first threecomponents and generate the answer.

Answer LSTM:for storing the linguistic context in ananswer

VQA Computer Vision & Image Processing March 12, 2016 9 / 16

Page 15: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Components

1 Object Proposal[Cheng et al. [?]and Endres etal. [6]]

Figure: The left column shows the 3 highest ranked proposals,The center column shows the highest ranked proposal with 50%overlap for each object. The right column shows the same for a75% threshold

VQA Computer Vision & Image Processing March 12, 2016 10 / 16

Page 16: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Components

2 Image CNN

vim = σ(wim(CNNim(I )) + bim)

σ: Nonlinear activation function.wim|dX4096 : Mapping matrixCNNimtakes image as input and outputs a fixed lengthvector.bim constant

3 LSTMLSTM layer stores the context information in its memorycells and serves as the bridge among the words in asequence (e.g. a question).It has three gate :

Input gate and Output gate : Regulate the read and writeaccess to the LSTM memory cells.Forget gate: Resets the memory cells when their contentsare out of date.

VQA Computer Vision & Image Processing March 12, 2016 11 / 16

Page 17: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Components

The columns environment

2 Sentence CNN

For sequential input σ, convolutionunit for feature map of type f on thel th layer is

v i(l,f )=

def σ(w(l,f )~vi(l1) + b(l,f ))

~v i(l−1) =

def v i(l−1)||v

i+1(l−1)||v

i+1(l−1)

~v i(0) =

def v iwd ||v i+1

wd ||v i+1wd

Max-pooling after each convolution

v i(l+1,f ) = max(v2i

(l,f ), v2i+1(l,f ) )

(a) High Level.

(b)DetailedLevel.

VQA Computer Vision & Image Processing March 12, 2016 12 / 16

Page 18: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Components

2 Multimodal Convolution

input: vqt = [v0(6)...v

n(6)]

Capturing the interaction between two multimodal inputs

~v i6 = v i

6||vim||v i+16 (1)

v i(mm,f ) = σ(w(mm,f )~v

i6 + b(mm,f )) (2)

(a) High Level.(b) DetailedLevel.

Figure: Multimodal Convolution[1].

VQA Computer Vision & Image Processing March 12, 2016 13 / 16

Page 19: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Performance

2 VQA performances on DAQUAR-Reduced[1]

(a) DAQUAR-Reduced. (b) COCO-QA.

Figure: VQA performances on DAQUAR-Reduced[1].

VQA Computer Vision & Image Processing March 12, 2016 14 / 16

Page 20: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Approach VQA Performance

VQA Demo[1]

VQA Computer Vision & Image Processing March 12, 2016 15 / 16

Page 21: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Thank You

Thank you

VQA Computer Vision & Image Processing March 12, 2016 16 / 16

Page 22: Content Based Visual Question Answer - IITKhome.iitk.ac.in/~badri/badripatro/project/CS676A_Project_Proposal.pdf · Content Based Visual Question Answer Badri Patro, Vinay Verma &

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

References

[1] L. Ma, Z. Lu, and H. Li., ‘‘Learning to Answer Questions From Imageusing Convolutional Neural Network”,CoRR abs/1506.00333, Nov,2015.

[2] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang and W. Xu.,‘‘Are youtalking to a machine? dataset and methods for multilingual imagequestion answering.”,arXiv 1505.05612v3, Nov, 2015.

[3] M. Ren, R. Kiros, and R. S. Zemel, ‘‘Exploring models and data forimage question answering”,arXiv 1505.02074, , 2015.

[4] M. Malinowski, M. Rohrbach, and M. Fritz.,‘‘Ask your neurons: Aneural-based approach to answering questions about images.”,arXiv1505.01121, Nov, 2015.

[5] A. Karpathy and L. Fei-Fei.,‘‘Deep Visual-Semantic Alignments forGenerating Image Descriptions.”,In CVPR, 2015.

[6] I. Endres, and D. Hoiem.,‘‘Category-Independent Object ProposalsWith Diverse Ranking.”, http://vision.cs.uiuc.edu/proposals/,PAMIFebruary, 2014.

[7] Cheng, Ming-Ming, et al.,‘‘BING: Binarized normed gradients forobjectness estimation at 300fps.”, Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2014.

VQA Computer Vision & Image Processing March 12, 2016 16 / 16