content based visual question answer -...
TRANSCRIPT
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Content Based Visual Question Answer
Badri Patro, Vinay Verma & Aatanu Samata
Group 30 :Project Proposal PresentationCS676A:Computer Vision and Image Processing
IIT Kanpur
March 12, 2016
VQA Computer Vision & Image Processing March 12, 2016 1 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Outline
1 IntroductionProblem StatementLiterature Survey
2 ApproachMethodsVQA Components
Object ProposalImage CNNSentence CNNMultimodal Convolution
VQA Performance
3 References
VQA Computer Vision & Image Processing March 12, 2016 2 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Introduction
Type of Question
Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).
Object detection(e.g., How many bikes are there?).
Activity recognition (e.g., Is this man crying?).
Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).
Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..
VQA Computer Vision & Image Processing March 12, 2016 3 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Introduction
Type of Question
Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).
Object detection(e.g., How many bikes are there?).
Activity recognition (e.g., Is this man crying?).
Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).
Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..
VQA Computer Vision & Image Processing March 12, 2016 3 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Introduction
Type of Question
Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).
Object detection(e.g., How many bikes are there?).
Activity recognition (e.g., Is this man crying?).
Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).
Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..
VQA Computer Vision & Image Processing March 12, 2016 3 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Introduction
Type of Question
Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).
Object detection(e.g., How many bikes are there?).
Activity recognition (e.g., Is this man crying?).
Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).
Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..
VQA Computer Vision & Image Processing March 12, 2016 3 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Introduction
Type of Question
Fine-grained recognition (e.g.,What kind of cheese ison the pizza?).
Object detection(e.g., How many bikes are there?).
Activity recognition (e.g., Is this man crying?).
Knowledge base reasoning(e.g., Why did Katappakilled Bahubali?).
Commonsense reasoning (e.g., Does this person have20/20 vision?, Is this person expecting company?)..
VQA Computer Vision & Image Processing March 12, 2016 3 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Visual Question Answering:
Problem Statement: Given an image, and content basedquestion, find the correct answer and confidence of theanswer.
Training on a set of triplets (image, question, answer).
Free-form and open-ended(*) questions.
Answers can be single word or multiple word,depending on dataset.
VQA Computer Vision & Image Processing March 12, 2016 4 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Problem Statement
Visual Question Answering:
Problem Statement: Given an image, and content basedquestion, find the correct answer and confidence of theanswer.
Training on a set of triplets (image, question, answer).
Free-form and open-ended(*) questions.
Answers can be single word or multiple word,depending on dataset.
VQA Computer Vision & Image Processing March 12, 2016 4 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Literature Survey
1 Neural Image-QA(Malinowski et al.,2015)[4]
2 VSE model- VIS LSTM(Ren et al.,May,2015)[3]
VQA Computer Vision & Image Processing March 12, 2016 5 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Introduction Literature Survey
3 mQA(Gao et al.,Nov, 2015)[2]
4 3CNN(Ma et al.,Nov,2015)[1]
VQA Computer Vision & Image Processing March 12, 2016 6 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach Methods
1 Modified CNN,[Ma et al.]
2 Modified CNN with QuestionLSTM,[Ma et al.]
VQA Computer Vision & Image Processing March 12, 2016 7 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach Methods
3 Modified CNN with AnswerLSTM,[Geo et al.]
4 Final Approach,[Ma et al.] & [Geo et al.]
VQA Computer Vision & Image Processing March 12, 2016 8 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Components
Component of Proposed Algorithm
Object Proposal:Find Interest Object
Image CNN:to extract the visual representation
Question LSTM : extract the question representation
Sentence CNN: question representation
Multimodal Convolution: a fusing component tocombine the information from the first threecomponents and generate the answer.
Answer LSTM:for storing the linguistic context in ananswer
VQA Computer Vision & Image Processing March 12, 2016 9 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Components
1 Object Proposal[Cheng et al. [?]and Endres etal. [6]]
Figure: The left column shows the 3 highest ranked proposals,The center column shows the highest ranked proposal with 50%overlap for each object. The right column shows the same for a75% threshold
VQA Computer Vision & Image Processing March 12, 2016 10 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Components
2 Image CNN
vim = σ(wim(CNNim(I )) + bim)
σ: Nonlinear activation function.wim|dX4096 : Mapping matrixCNNimtakes image as input and outputs a fixed lengthvector.bim constant
3 LSTMLSTM layer stores the context information in its memorycells and serves as the bridge among the words in asequence (e.g. a question).It has three gate :
Input gate and Output gate : Regulate the read and writeaccess to the LSTM memory cells.Forget gate: Resets the memory cells when their contentsare out of date.
VQA Computer Vision & Image Processing March 12, 2016 11 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Components
The columns environment
2 Sentence CNN
For sequential input σ, convolutionunit for feature map of type f on thel th layer is
v i(l,f )=
def σ(w(l,f )~vi(l1) + b(l,f ))
~v i(l−1) =
def v i(l−1)||v
i+1(l−1)||v
i+1(l−1)
~v i(0) =
def v iwd ||v i+1
wd ||v i+1wd
Max-pooling after each convolution
v i(l+1,f ) = max(v2i
(l,f ), v2i+1(l,f ) )
(a) High Level.
(b)DetailedLevel.
VQA Computer Vision & Image Processing March 12, 2016 12 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Components
2 Multimodal Convolution
input: vqt = [v0(6)...v
n(6)]
Capturing the interaction between two multimodal inputs
~v i6 = v i
6||vim||v i+16 (1)
v i(mm,f ) = σ(w(mm,f )~v
i6 + b(mm,f )) (2)
(a) High Level.(b) DetailedLevel.
Figure: Multimodal Convolution[1].
VQA Computer Vision & Image Processing March 12, 2016 13 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Performance
2 VQA performances on DAQUAR-Reduced[1]
(a) DAQUAR-Reduced. (b) COCO-QA.
Figure: VQA performances on DAQUAR-Reduced[1].
VQA Computer Vision & Image Processing March 12, 2016 14 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Approach VQA Performance
VQA Demo[1]
VQA Computer Vision & Image Processing March 12, 2016 15 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
Thank You
Thank you
VQA Computer Vision & Image Processing March 12, 2016 16 / 16
ContentBased Visual
QuestionAnswer
Badri Patro,Vinay Verma& Aatanu
Samata
Outline
Introduction
ProblemStatement
LiteratureSurvey
Approach
Methods
VQAComponents
ObjectProposal
Image CNN
Sentence CNN
MultimodalConvolution
VQAPerformance
Thank You
References
References
[1] L. Ma, Z. Lu, and H. Li., ‘‘Learning to Answer Questions From Imageusing Convolutional Neural Network”,CoRR abs/1506.00333, Nov,2015.
[2] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang and W. Xu.,‘‘Are youtalking to a machine? dataset and methods for multilingual imagequestion answering.”,arXiv 1505.05612v3, Nov, 2015.
[3] M. Ren, R. Kiros, and R. S. Zemel, ‘‘Exploring models and data forimage question answering”,arXiv 1505.02074, , 2015.
[4] M. Malinowski, M. Rohrbach, and M. Fritz.,‘‘Ask your neurons: Aneural-based approach to answering questions about images.”,arXiv1505.01121, Nov, 2015.
[5] A. Karpathy and L. Fei-Fei.,‘‘Deep Visual-Semantic Alignments forGenerating Image Descriptions.”,In CVPR, 2015.
[6] I. Endres, and D. Hoiem.,‘‘Category-Independent Object ProposalsWith Diverse Ranking.”, http://vision.cs.uiuc.edu/proposals/,PAMIFebruary, 2014.
[7] Cheng, Ming-Ming, et al.,‘‘BING: Binarized normed gradients forobjectness estimation at 300fps.”, Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 2014.
VQA Computer Vision & Image Processing March 12, 2016 16 / 16