deep learning in computer visioncvboy.com/slides/msra_language_and_vision.pdf · image captioning...

Deep Learning in Computer Vision

Yikang Li

MMLab, The Chinese University of Hong KongSep 22nd, 2017 @Microsoft Research Asia, China

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Introduction - DL in the press

[1]http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/[2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-one-photo-research-shows.html

http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/

https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-one-photo-research-shows.html

https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-one-photo-research-shows.html

Introduction - DL in the press

CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors)http://cvpr2017.thecvf.com/

http://cvpr2017.thecvf.com/

Introduction - Investment in AI

http://business.financialpost.com/technology/federal-and-ontario-governments-invest-up-to-100-million-in-new-artificial-intelligence-vector-institute/wcm/ceb9218f-cbaf-4968-a6a6-cceff5ec3754







Renowned Researchers/Groups

- Trevor Darrell, BAIR, UC Berkeley

- Recognition, detection

- Yanqing Jia (Caffe), Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN)

- Fei-Fei LI, Stanford University

- ImageNet, Emerging topics

- Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI)

- Antonio Torralba, CSAIL, MIT

- Scene understanding, multimodality-based Computer Vision

- Facebook Artificial Intelligence Research (FAIR)

- DeepMind, Google Brain, Google Research

- Microsoft Research

Outline

1. Introduction






Roadmap of Deep Learning - Depth

Basic Block - Convolution

Convolution operation: f(x) = Wx + b, f is called feature maps. W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations.

Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks.

Deep neural network is just a stack of convolutional layers.Rule of thumb: deeper means better.

Kernel Kernel

Roadmap of Deep Learning - Network Structure (cont’d)

Outline

1. Introduction






What is object detection?http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

Why object detection?It is the fundamental task in vision

+ Detection in general classes+ Face detection, crowd analysis+ Car/signal detection

RCNN -> Fast RCNN -> Faster RCNN

Detection Results

Back into the General Picture: Deep Learning for Computer Vision

Outline

1. Introduction






Image Captioning Describe image with a natural sentence

❏ Two gentleman talking in front of propeller plane.

❏ Two men are conversing next to a small airplane.

❏ Two men talking in front of a plane.❏ Two men talking in front of a small plane.❏ Two men talk while standing next to a small

passenger plane at an airport.

Dataset:

- PASCAL SENTENCE DATASET- 1000 images & 5 sents / im- Designed for image classification, object detection and segmentation.- No filtering, complex scenes, scaling, view points of different objects.

- FLICKR 8K- 8108 images & 5 sents / im- obtained from the Flickr website by University of Illinois at Urbana, Champaign

- FLICKR 30K- extension to the Flickr 8K

- MS COCO- Largest Caption dataset- Includes captions & object annoatations- 328,000 images & 5 sents / im

- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes- 108,077 images with full annotations- Not very clean, need a little pre-processing

Metric:

- BLEU, METEOR, Rouge, CIDEr, Human-based Measurement

Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

http://vision.cs.uiuc.edu/pascal-sentences/

http://nlp.cs.illinois.edu/HockenmaierGroup/8k-pictures.html

http://shannon.cs.illinois.edu/DenotationGraph/

http://mscoco.org/

http://visualgenome.org/

http://sidgan.me/technical/2016/01/09/Exploring-Datasets

A simple Baseline: NeuralTalk

A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

https://github.com/karpathy/neuraltalk

Attention Mechanism: Show, Attend and Tell

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

https://arxiv.org/abs/1502.03044

Modified Attention Mechanism: Know when to look

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

Adaptive Attention moduleDetermine how to mix the visual or linguistic information with a visual

sentinel (softmax over k feature map vectors & 1 linguistic vector).


Concept-driven Image Captioning

Semantic Compositional Networks for Visual Captioning:



Dense CaptioningLocalize and describe salient region with a natural sentence

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

http://cs.stanford.edu/people/karpathy/densecap/

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

http://cs.stanford.edu/people/karpathy/densecap/

Outline

1. Introduction






Visual Q&AAnswer an image-based question

Question: What color is the man's tie?Answer: Brown

Dataset:

- DAQUAR- first dataset and benchmark released for the VQA task- Images are from NYU Depth V2 dataset with semantic segmentations- 1449 images (795 training, 654 test), 12468 question (auto-generated &

human-annotated) - COCO-QA

- Automatically generated from image captions.- 123287 images, 78736 train questions, 38948 test questions- 4 types of questions: object, number, color, location- Answers are all one-word

- VQA- Most widely-used VQA dataset- two parts: one contains images from COCO, the other contains abstract scenes- 204,721 COCO and 50,000 abstract images with ~5.4 questions/im

- CLEVR- A Diagnostic Dataset for the reasoning ability of VQA models- rendered images and automatically-generated questions with functional

programs and scene graphs- 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im

- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes- 108,077 images with 1.7M grounded Q&A pairs- Not very clean, need a little pre-processing

Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/visual-turing-challenge/

http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa/

http://www.visualqa.org/

http://cs.stanford.edu/people/jcjohns/clevr/

http://visualgenome.org/


Simple Baseline Method

Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167


A Strong Baseline: Attention (1)

Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394


A Strong Baseline: Attention (2)

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

Multiple glimpse


Co-Attention Mechanism for Image & Question

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Parallel Co-Attention Alternating Co-Attention


Hierarchical Question Encoding

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Hierarchical Question Encoding Scheme Encoding for Answer prediction


Multimodal Fusion: Bilinear interaction modeling

MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676


Duality of Question Answering and Question Generation

Visual Question Generation as Dual Task of Visual Question Answering

Duality of Question Answering and Question Generation: Dual MUTAN

Visual Question Generation as Dual Task of Visual Question Answering

Learning to Reason: Compositional Network

Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

End-to-End Training with policy gradient


Outline

1. Introduction






Visual RelationsDescribe the Image with object nodes and their interactions

Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700


Baseline: Visual Relationship Detection with Language PriorUsing word2vec as extra information for predicate recognition

Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

http://cs.stanford.edu/people/ranjaykrishna/vrd/

Jointly detect objects and relationsLeverage the dependencies within the objects and their relationships as extra constraints

Triplet Proposal:- Region proposal: RPN generates object proposals- Triplet proposal: group the proposals and generate

<subject-phrase-object> triplet proposal- Triplet NMS: redundant proposal removal

Phrase Detection:- Branch-based detection model- ROI pooling helps different branch focus on different components- Message passing structure (VPRS) help different branches share

information and consider the three components as a whole

ViP-CNN: Visual Phrase Guided Convolutional Neural Network: http://cvboy.com/publication/cvpr2017_vip_cnn/

http://cvboy.com/publication/cvpr2017_vip_cnn/

Relations as an intermediate level of Objects and Region CaptionsLeverage the dependencies within the objects and their relationships as extra constraints

Scene Graph Generation from Objects, Phrases and Region Captions: http://cvboy.com/publication/iccv2017_msdn/

http://cvboy.com/publication/iccv2017_msdn/

Emerging Topics: Human-Object Interaction

Detecting and Recognizing Human-Object Interactions: https://arxiv.org/abs/1704.07333


Q&[email protected]

deep learning in computer visioncvboy.com/slides/msra_language_and_vision.pdf · image captioning...

Documents