content based visual question answer -...

ContentBased Visual

QuestionAnswer

Badri Patro,Vinay Verma& Aatanu

Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN

MultimodalConvolution

VQAPerformance

Thank You

References

Content Based Visual Question Answer

Badri Patro, Vinay Verma & Aatanu Samata

Group 30 :Project Proposal PresentationCS676A:Computer Vision and Image Processing

IIT Kanpur

March 12, 2016

VQA Computer Vision & Image Processing March 12, 2016 1 / 16

ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Outline

1 IntroductionProblem StatementLiterature Survey

2 ApproachMethodsVQA Components

Object ProposalImage CNNSentence CNNMultimodal Convolution

VQA Performance

3 References


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Introduction Problem Statement

Introduction

Type of Question

Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).

Object detection(e.g., How many bikes are there?).

Activity recognition (e.g., Is this man crying?).

Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).

Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Introduction Problem Statement

Visual Question Answering:

Problem Statement: Given an image, and content basedquestion, find the correct answer and confidence of theanswer.

Training on a set of triplets (image, question, answer).

Free-form and open-ended(*) questions.

Answers can be single word or multiple word,depending on dataset.


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Introduction Literature Survey

1 Neural Image-QA(Malinowski et al.,2015)[4]

2 VSE model- VIS LSTM(Ren et al.,May,2015)[3]


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Introduction Literature Survey

3 mQA(Gao et al.,Nov, 2015)[2]

4 3CNN(Ma et al.,Nov,2015)[1]


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Approach Methods

1 Modified CNN,[Ma et al.]

2 Modified CNN with QuestionLSTM,[Ma et al.]


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Approach Methods

3 Modified CNN with AnswerLSTM,[Geo et al.]

4 Final Approach,[Ma et al.] & [Geo et al.]


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Approach VQA Components

Component of Proposed Algorithm

Object Proposal:Find Interest Object

Image CNN:to extract the visual representation

Question LSTM : extract the question representation

Sentence CNN: question representation

Multimodal Convolution: a fusing component tocombine the information from the first threecomponents and generate the answer.

Answer LSTM:for storing the linguistic context in ananswer


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References


1 Object Proposal[Cheng et al. [?]and Endres etal. [6]]

Figure: The left column shows the 3 highest ranked proposals,The center column shows the highest ranked proposal with 50%overlap for each object. The right column shows the same for a75% threshold


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References


2 Image CNN

vim = σ(wim(CNNim(I )) + bim)

σ: Nonlinear activation function.wim|dX4096 : Mapping matrixCNNimtakes image as input and outputs a fixed lengthvector.bim constant

3 LSTMLSTM layer stores the context information in its memorycells and serves as the bridge among the words in asequence (e.g. a question).It has three gate :

Input gate and Output gate : Regulate the read and writeaccess to the LSTM memory cells.Forget gate: Resets the memory cells when their contentsare out of date.


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References


The columns environment

2 Sentence CNN

For sequential input σ, convolutionunit for feature map of type f on thel th layer is

v i(l,f )=

def σ(w(l,f )~vi(l1) + b(l,f ))

~v i(l−1) =

def v i(l−1)||v

i+1(l−1)||v

i+1(l−1)

~v i(0) =

def v iwd ||v i+1

wd ||v i+1wd

Max-pooling after each convolution

v i(l+1,f ) = max(v2i

(l,f ), v2i+1(l,f ) )

(a) High Level.

(b)DetailedLevel.


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References


2 Multimodal Convolution

input: vqt = [v0(6)...v

n(6)]

Capturing the interaction between two multimodal inputs

~v i6 = v i

6||vim||v i+16 (1)

v i(mm,f ) = σ(w(mm,f )~v

i6 + b(mm,f )) (2)

(a) High Level.(b) DetailedLevel.

Figure: Multimodal Convolution[1].


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Approach VQA Performance

2 VQA performances on DAQUAR-Reduced[1]

(a) DAQUAR-Reduced. (b) COCO-QA.

Figure: VQA performances on DAQUAR-Reduced[1].


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Approach VQA Performance

VQA Demo[1]


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

Thank You

Thank you


ContentBased Visual

QuestionAnswer


Samata

Outline

Introduction

ProblemStatement

LiteratureSurvey

Approach

Methods

VQAComponents

ObjectProposal

Image CNN

Sentence CNN


VQAPerformance

Thank You

References

References

[1] L. Ma, Z. Lu, and H. Li., ‘‘Learning to Answer Questions From Imageusing Convolutional Neural Network”,CoRR abs/1506.00333, Nov,2015.

[2] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang and W. Xu.,‘‘Are youtalking to a machine? dataset and methods for multilingual imagequestion answering.”,arXiv 1505.05612v3, Nov, 2015.

[3] M. Ren, R. Kiros, and R. S. Zemel, ‘‘Exploring models and data forimage question answering”,arXiv 1505.02074, , 2015.

[4] M. Malinowski, M. Rohrbach, and M. Fritz.,‘‘Ask your neurons: Aneural-based approach to answering questions about images.”,arXiv1505.01121, Nov, 2015.

[5] A. Karpathy and L. Fei-Fei.,‘‘Deep Visual-Semantic Alignments forGenerating Image Descriptions.”,In CVPR, 2015.

[6] I. Endres, and D. Hoiem.,‘‘Category-Independent Object ProposalsWith Diverse Ranking.”, http://vision.cs.uiuc.edu/proposals/,PAMIFebruary, 2014.

[7] Cheng, Ming-Ming, et al.,‘‘BING: Binarized normed gradients forobjectness estimation at 300fps.”, Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2014.


content based visual question answer -...

Documents