vqa: visual question answering · visual question answering given an input image and a...

25

VQA: Visual Question Answering Anmol Popli and Sparsh Gupta

Upload: others

Post on 27-Jun-2020

5 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

VQA: Visual Question AnsweringAnmol Popli and Sparsh Gupta

Page 2: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 3: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 4: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Visual Question AnsweringGiven an input image and a natural-language question about the image, the task is to provide a natural-language answer as output.

Some key areas of VQA application are:-- Helping visually impaired users understand their surroundings

-- Helping intelligence analysts working on visual data

-- Efficient image retrieval for specific search queries

Page 5: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Visual Question AnsweringKey Challenges-- Multi-discipline problem - language understanding and vision understanding

-- Vision tasks: object detection, object recognition, counting, scene classification

Source: Antol et al. VQA: Visual Question Answering. IEEE ICCV 2015

Page 6: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 7: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Dataset● VQA Dataset (v 2.0) for VQA 2019 challenge

● Our work focuses on Balanced Real Images portion of the entire dataset

● 444k questions based on 83k images

● 10 annotations per question

www.visualqa.org https://visualqa.org/download.html

http://www.visualqa.org

https://visualqa.org/download.html

Page 8: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Preprocessing and Training Specifications● Resized images to 224 x 224● Chose only top 3000 most frequent answers● Converted text to lowercase, removed punctuations, and treated named

entities as UNK● Max length of question text capped to 15 words● Used GloVe embeddings to initialize word embeddings● Used Cross Entropy Loss for the 3000-class classification problem● Used Adam optimizer with scheduled learning rate decay● L2 regularization of weights

Page 9: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 10: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Joint Embedding Based Model

Source: Antol et al. VQA: Visual Question Answering. IEEE ICCV 2015

Page 11: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Joint Embedding Based Model - Examples

is the tv on or off? on what side of the plate is the fork on? left

what flavor is the cake? vanilla what is the man in the right wearing? wetsuit

Page 12: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 13: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model

Source: Kazemi et al. Show, ask, attend and answer: Strong Baseline for Visual Question Answering. arxiv.org/abs/1704.03162

Page 14: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 15: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 16: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Examples

Page 17: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Vanilla Attention Based Model - Attention Visualization

How many planes are in the air? 1 What kind of collar is the man’s shirt? white

Page 18: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 19: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Bottom-Up and Top-Down Attention Based Model

Source: Anderson et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. arxiv.org/abs/1707.07998

Page 20: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Bottom-Up and Top-Down Attention Based Model● Faster-RCNN with ResNet-101 backend

● Regions of Interest (ROIs) obtained from Faster-RCNN

● Allows the model to compute attention for selected regions, and not the entire image

● Feature maps of each ROI are computed and are passed to the model

● Attention weight for each ROI feature map is computed

Page 21: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 22: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Multihop Attention Network● Image features from

Faster-RCNN for candidate ROIs

● Question features from LSTM

● Multiple layers of attention for attending to different regions of the image based on different words

● Aggregation - concatenation or addition

Page 23: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

AgendaIntroduction

Dataset

Joint Embedding Based Model

Vanilla Attention Based Model

Bottom-Up and Top-Down Attention Based Model

Ideas from Multihop Attention Network

Results and Future Work

Page 24: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Results and Future Work

● Results obtained on validation data are at par with the results reported in respective papers

● Experimental model yet to be trained and tested

● Final test data predictions of experimental model to be submitted to challenge page

● Code will be made public shortly

Model Accuracy on Validation Data

Joint Embedding Based Model - VGG 16 49.28 %

Joint Embedding Based Model - ResNet 152 51.76 %

Vanilla Attention Based Model - VGG 16 55.35%

Vanilla Attention Based Model - ResNet 152 57.13%

Page 25: VQA: Visual Question Answering · Visual Question Answering Given an input image and a natural-language question about the image, the task is to provide a natural-language answer

Q & A

RUBi: Reducing Unimodal Biases in Visual Question Answering › pdf › 1906.10169.pdf · RUBi: Reducing Unimodal Biases in Visual Question Answering Remi Cadene 1Corentin Dancette

An Analysis of Visual Question Answering AlgorithmsAn Analysis of Visual Question Answering Algorithms Kushal Kaﬂe Christopher Kanan Rochester Institute of Technology Rochester,

Visual Question Answering on 360deg Images...Visual Question Answering on 360 ... of them aim for open-ended answering [2, 19, 30], provid-ing for a pair of image and question with

Structured Attention for Visual Question Answering Chen ...faculty.sist.shanghaitech.edu.cn/faculty/tukw/iccv17ZZHTM-poster.pdf · Structured Attention for Visual Question Answering

An Interpretable Multimodal Visual Question Answering

Heterogeneous Graph Learning for Visual Commonsense Reasoning · for image captioning and visual question answering. In multi-hop reasoning question answering task, a bi-directional

Open-Ended Visual Question-Answering · Open-Ended Visual Question-Answering A Degree Thesis Submitted to the Faculty of the Escola T ecnica Superior d’Enginyeria de Telecomunicaci

Answering Visual Question - University Of Illinoisslazebni.cs.illinois.edu/spring17/lec23_vqa.pdfAntol, Stanislaw, et al. "Vqa: Visual question answering." ... Saurabh Singh, and Derek

Deep Learning for Visual Question Answering · Deep Learning for Visual Question Answering Avi Singh [email protected] Abstract This project deals with the problem of Visual Question

1 Million Full-Sentences Visual Question Answering (FSVQA) · 1 Million Full-Sentences Visual Question Answering (FSVQA) ... ating short and repetitive answers, ... Question Ans

VQA: Visual Question Answering - CVF Open Accessopenaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA...VQA: Visual Question Answering Stanislaw Antol∗1 Aishwarya Agrawal∗1

MUTAN: Multimodal Tucker Fusion for Visual Question Answering

VQA: Visual Question Answering - arXiv · 1 VQA: Visual Question Answering Aishwarya Agrawal , Jiasen Lu , Stanislaw Antol , Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi

Towards Good Practice for Visual Question Answeringwangzheallen.github.io/papers/vqa_showevent.pdf · Hierarchical question-image co-attention for visual question answering. In NIPS,

Medical Visual Question Answering via Conditional Reasoning

IQA: Visual Question Answering in Interactive Environments · 2019-03-12 · IQA: Visual Question Answering in Interactive Environments Daniel Gordon1 Aniruddha Kembhavi 2Mohammad

Explicit Bias Discovery in Visual Question Answering Models

Document Collection Visual Question Answering

Designing deep architectures for Visual Question AnsweringVisual Question Answering: What does Claudia do?+ Visual Question Answering Sitting at the bottom Standing at the back …

Knowledge Acquisition for Visual Question Answering via Iterative … › content_cvpr_2017 › poster › ... · 2018-01-23 · Knowledge Acquisition for Visual Question Answering

OK-VQA: A Visual Question Answering Benchmark Requiring ...openaccess.thecvf.com/content_CVPR_2019/papers/Marino_OK-VQA… · 2. Related Work Visual Question Answering (VQA). Visual

High-Order Attention Models for Visual Question Answering

1 VQA: Visual Question Answering - Virginia Tech › ~santol › projects › vqa › iccv...1 VQA: Visual Question Answering Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu, Margaret

What is in that picture ? Visual Question Answering Systemcs229.stanford.edu/proj2017/final-reports/5243359.pdf · 1 Introduction Visual Question Answering is a complex task which

IQA: Visual Question Answering in Interactive Environments · 2018-09-24 · IQA: Visual Question Answering in Interactive Environments Daniel Gordon1 Aniruddha Kembhavi 2Mohammad

Knowledge Acquisition for Visual Question Answering via ...vision.stanford.edu › pdf › zhu2017cvpr.pdf · a visual question answering (VQA) task? Most of today’s VQA methods

MemexQA: Visual Memex Question Answering · 2017. 8. 7. · MemexQA: Visual Memex Question Answering Lu Jiang 1, Junwei Liang , Liangliang Cao2, Yannis Kalantidis3y Sachin Farfade3,

IQA: Visual Question Answering in Interactive … Visual Question Answering in Interactive Environments Daniel Gordon2 Aniruddha Kembhavi 1Mohammad Rastegari Joseph Redmon 2Dieter

Visual language processing: image/video captioning and question answering

Knowledge Acquisition for Visual Question Answering via Iterative …openaccess.thecvf.com/content_cvpr_2017/poster/397_POSTER.pdf · image co-aIen6on for visual ques6on answering

Check It Again:Progressive Visual Question Answering via

OCR-VQA: Visual Question Answering by Reading Text in Images · 2020-06-04 · dataset for studying visual question answering by reading text, i.e., OCR-VQA task.1 C. Visual question

An Empirical Evaluation of Visual Question Answering for Novel … · 2017. 4. 11. · An Empirical Evaluation of Visual Question Answering for Novel Objects Santhosh K. Ramakrishnan

Question-Guided Hybrid Convolution for Visual …openaccess.thecvf.com/content_ECCV_2018/papers/gao_peng...Question-Guided Hybrid Convolution for Visual Question Answering 3 Motivated

Answer-Type Prediction for Visual Question Answering