deep learning in computer visioncvboy.com/slides/msra_language_and_vision.pdf · image captioning...

45
Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Deep Learning in Computer Vision

Yikang Li

MMLab, The Chinese University of Hong KongSep 22nd, 2017 @Microsoft Research Asia, China

Page 2: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 3: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Introduction - DL in the press

[1]http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/[2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-one-photo-research-shows.html

Page 4: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Introduction - DL in the press

CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors)http://cvpr2017.thecvf.com/

Page 6: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Renowned Researchers/Groups

- Trevor Darrell, BAIR, UC Berkeley

- Recognition, detection

- Yanqing Jia (Caffe), Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN)

- Fei-Fei LI, Stanford University

- ImageNet, Emerging topics

- Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI)

- Antonio Torralba, CSAIL, MIT

- Scene understanding, multimodality-based Computer Vision

- Facebook Artificial Intelligence Research (FAIR)

- DeepMind, Google Brain, Google Research

- Microsoft Research

Page 7: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 8: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Roadmap of Deep Learning - Depth

Page 9: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Basic Block - Convolution

Convolution operation: f(x) = Wx + b, f is called feature maps. W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations.

Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks.

Deep neural network is just a stack of convolutional layers.Rule of thumb: deeper means better.

Kernel Kernel

Page 10: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Roadmap of Deep Learning - Network Structure (cont’d)

Page 11: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 12: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

What is object detection?http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

Page 13: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Why object detection?It is the fundamental task in vision

+ Detection in general classes+ Face detection, crowd analysis+ Car/signal detection

Page 14: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

RCNN -> Fast RCNN -> Faster RCNN

Page 15: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

RCNN -> Fast RCNN -> Faster RCNN

Page 16: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

RCNN -> Fast RCNN -> Faster RCNN

Page 17: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

RCNN -> Fast RCNN -> Faster RCNN

Page 18: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Detection Results

Page 19: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Back into the General Picture: Deep Learning for Computer Vision

Page 20: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 21: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Image Captioning Describe image with a natural sentence

❏ Two gentleman talking in front of propeller plane.

❏ Two men are conversing next to a small airplane.

❏ Two men talking in front of a plane.❏ Two men talking in front of a small plane.❏ Two men talk while standing next to a small

passenger plane at an airport.

Dataset:

- PASCAL SENTENCE DATASET- 1000 images & 5 sents / im- Designed for image classification, object detection and segmentation.- No filtering, complex scenes, scaling, view points of different objects.

- FLICKR 8K- 8108 images & 5 sents / im- obtained from the Flickr website by University of Illinois at Urbana, Champaign

- FLICKR 30K- extension to the Flickr 8K

- MS COCO- Largest Caption dataset- Includes captions & object annoatations- 328,000 images & 5 sents / im

- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes- 108,077 images with full annotations- Not very clean, need a little pre-processing

Metric:

- BLEU, METEOR, Rouge, CIDEr, Human-based Measurement

Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

Page 22: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

A simple Baseline: NeuralTalk

A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

Page 23: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Attention Mechanism: Show, Attend and Tell

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

Page 24: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Modified Attention Mechanism: Know when to look

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

Adaptive Attention moduleDetermine how to mix the visual or linguistic information with a visual

sentinel (softmax over k feature map vectors & 1 linguistic vector).

Page 25: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Concept-driven Image Captioning

Semantic Compositional Networks for Visual Captioning:

https://arxiv.org/abs/1502.03044

Page 26: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Dense CaptioningLocalize and describe salient region with a natural sentence

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

Page 27: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

Page 28: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 29: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Visual Q&AAnswer an image-based question

Question: What color is the man's tie?Answer: Brown

Dataset:

- DAQUAR- first dataset and benchmark released for the VQA task- Images are from NYU Depth V2 dataset with semantic segmentations- 1449 images (795 training, 654 test), 12468 question (auto-generated &

human-annotated) - COCO-QA

- Automatically generated from image captions.- 123287 images, 78736 train questions, 38948 test questions- 4 types of questions: object, number, color, location- Answers are all one-word

- VQA- Most widely-used VQA dataset- two parts: one contains images from COCO, the other contains abstract scenes- 204,721 COCO and 50,000 abstract images with ~5.4 questions/im

- CLEVR- A Diagnostic Dataset for the reasoning ability of VQA models- rendered images and automatically-generated questions with functional

programs and scene graphs- 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im

- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes- 108,077 images with 1.7M grounded Q&A pairs- Not very clean, need a little pre-processing

Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

Page 30: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Simple Baseline Method

Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167

Page 31: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

A Strong Baseline: Attention (1)

Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394

Page 32: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

A Strong Baseline: Attention (2)

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

Multiple glimpse

Page 33: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Co-Attention Mechanism for Image & Question

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Parallel Co-Attention Alternating Co-Attention

Page 34: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Hierarchical Question Encoding

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Hierarchical Question Encoding Scheme Encoding for Answer prediction

Page 35: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Multimodal Fusion: Bilinear interaction modeling

MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676

Page 36: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Duality of Question Answering and Question Generation

Visual Question Generation as Dual Task of Visual Question Answering

Page 37: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Duality of Question Answering and Question Generation: Dual MUTAN

Visual Question Generation as Dual Task of Visual Question Answering

Page 38: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Learning to Reason: Compositional Network

Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

End-to-End Training with policy gradient

Page 39: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Outline

1. Introduction

2. Roadmap of Deep Learning

3. DL in CV: Object detection

4. DL in CV: Image Captioning

5. DL in CV: Visual Question Answering

6. DL in CV: Visual Relations

Page 40: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Visual RelationsDescribe the Image with object nodes and their interactions

Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700

Page 41: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Baseline: Visual Relationship Detection with Language PriorUsing word2vec as extra information for predicate recognition

Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

Page 42: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Jointly detect objects and relationsLeverage the dependencies within the objects and their relationships as extra constraints

Triplet Proposal:- Region proposal: RPN generates object proposals- Triplet proposal: group the proposals and generate

<subject-phrase-object> triplet proposal- Triplet NMS: redundant proposal removal

Phrase Detection:- Branch-based detection model- ROI pooling helps different branch focus on different components- Message passing structure (VPRS) help different branches share

information and consider the three components as a whole

ViP-CNN: Visual Phrase Guided Convolutional Neural Network: http://cvboy.com/publication/cvpr2017_vip_cnn/

Page 43: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Relations as an intermediate level of Objects and Region CaptionsLeverage the dependencies within the objects and their relationships as extra constraints

Scene Graph Generation from Objects, Phrases and Region Captions: http://cvboy.com/publication/iccv2017_msdn/

Page 44: Deep Learning in Computer Visioncvboy.com/slides/msra_language_and_vision.pdf · Image Captioning Describe image with a natural sentence Two gentleman talking in front of propeller

Emerging Topics: Human-Object Interaction

Detecting and Recognizing Human-Object Interactions: https://arxiv.org/abs/1704.07333