deep learning in computer visioncvboy.com/slides/msra_language_and_vision.pdf · image captioning...
TRANSCRIPT
Deep Learning in Computer Vision
Yikang Li
MMLab, The Chinese University of Hong KongSep 22nd, 2017 @Microsoft Research Asia, China
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
Introduction - DL in the press
[1]http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/[2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-one-photo-research-shows.html
Introduction - DL in the press
CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors)http://cvpr2017.thecvf.com/
Introduction - Investment in AI
http://business.financialpost.com/technology/federal-and-ontario-governments-invest-up-to-100-million-in-new-artificial-intelligence-vector-institute/wcm/ceb9218f-cbaf-4968-a6a6-cceff5ec3754
Renowned Researchers/Groups
- Trevor Darrell, BAIR, UC Berkeley
- Recognition, detection
- Yanqing Jia (Caffe), Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN)
- Fei-Fei LI, Stanford University
- ImageNet, Emerging topics
- Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI)
- Antonio Torralba, CSAIL, MIT
- Scene understanding, multimodality-based Computer Vision
- Facebook Artificial Intelligence Research (FAIR)
- DeepMind, Google Brain, Google Research
- Microsoft Research
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
Roadmap of Deep Learning - Depth
Basic Block - Convolution
Convolution operation: f(x) = Wx + b, f is called feature maps. W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations.
Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks.
Deep neural network is just a stack of convolutional layers.Rule of thumb: deeper means better.
Kernel Kernel
Roadmap of Deep Learning - Network Structure (cont’d)
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
What is object detection?http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf
Why object detection?It is the fundamental task in vision
+ Detection in general classes+ Face detection, crowd analysis+ Car/signal detection
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
Detection Results
Back into the General Picture: Deep Learning for Computer Vision
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
Image Captioning Describe image with a natural sentence
❏ Two gentleman talking in front of propeller plane.
❏ Two men are conversing next to a small airplane.
❏ Two men talking in front of a plane.❏ Two men talking in front of a small plane.❏ Two men talk while standing next to a small
passenger plane at an airport.
Dataset:
- PASCAL SENTENCE DATASET- 1000 images & 5 sents / im- Designed for image classification, object detection and segmentation.- No filtering, complex scenes, scaling, view points of different objects.
- FLICKR 8K- 8108 images & 5 sents / im- obtained from the Flickr website by University of Illinois at Urbana, Champaign
- FLICKR 30K- extension to the Flickr 8K
- MS COCO- Largest Caption dataset- Includes captions & object annoatations- 328,000 images & 5 sents / im
- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),
attributes- 108,077 images with full annotations- Not very clean, need a little pre-processing
Metric:
- BLEU, METEOR, Rouge, CIDEr, Human-based Measurement
Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets
A simple Baseline: NeuralTalk
A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk
Attention Mechanism: Show, Attend and Tell
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044
Modified Attention Mechanism: Know when to look
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887
Adaptive Attention moduleDetermine how to mix the visual or linguistic information with a visual
sentinel (softmax over k feature map vectors & 1 linguistic vector).
Concept-driven Image Captioning
Semantic Compositional Networks for Visual Captioning:
https://arxiv.org/abs/1502.03044
Dense CaptioningLocalize and describe salient region with a natural sentence
DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
Visual Q&AAnswer an image-based question
Question: What color is the man's tie?Answer: Brown
Dataset:
- DAQUAR- first dataset and benchmark released for the VQA task- Images are from NYU Depth V2 dataset with semantic segmentations- 1449 images (795 training, 654 test), 12468 question (auto-generated &
human-annotated) - COCO-QA
- Automatically generated from image captions.- 123287 images, 78736 train questions, 38948 test questions- 4 types of questions: object, number, color, location- Answers are all one-word
- VQA- Most widely-used VQA dataset- two parts: one contains images from COCO, the other contains abstract scenes- 204,721 COCO and 50,000 abstract images with ~5.4 questions/im
- CLEVR- A Diagnostic Dataset for the reasoning ability of VQA models- rendered images and automatically-generated questions with functional
programs and scene graphs- 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im
- Visual Genome- Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded),
attributes- 108,077 images with 1.7M grounded Q&A pairs- Not very clean, need a little pre-processing
Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865
Simple Baseline Method
Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167
A Strong Baseline: Attention (1)
Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394
A Strong Baseline: Attention (2)
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162
Multiple glimpse
Co-Attention Mechanism for Image & Question
Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061
Parallel Co-Attention Alternating Co-Attention
Hierarchical Question Encoding
Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061
Hierarchical Question Encoding Scheme Encoding for Answer prediction
Multimodal Fusion: Bilinear interaction modeling
MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676
Duality of Question Answering and Question Generation
Visual Question Generation as Dual Task of Visual Question Answering
Duality of Question Answering and Question Generation: Dual MUTAN
Visual Question Generation as Dual Task of Visual Question Answering
Learning to Reason: Compositional Network
Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526
End-to-End Training with policy gradient
Outline
1. Introduction
2. Roadmap of Deep Learning
3. DL in CV: Object detection
4. DL in CV: Image Captioning
5. DL in CV: Visual Question Answering
6. DL in CV: Visual Relations
Visual RelationsDescribe the Image with object nodes and their interactions
Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700
Baseline: Visual Relationship Detection with Language PriorUsing word2vec as extra information for predicate recognition
Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/
Jointly detect objects and relationsLeverage the dependencies within the objects and their relationships as extra constraints
Triplet Proposal:- Region proposal: RPN generates object proposals- Triplet proposal: group the proposals and generate
<subject-phrase-object> triplet proposal- Triplet NMS: redundant proposal removal
Phrase Detection:- Branch-based detection model- ROI pooling helps different branch focus on different components- Message passing structure (VPRS) help different branches share
information and consider the three components as a whole
ViP-CNN: Visual Phrase Guided Convolutional Neural Network: http://cvboy.com/publication/cvpr2017_vip_cnn/
Relations as an intermediate level of Objects and Region CaptionsLeverage the dependencies within the objects and their relationships as extra constraints
Scene Graph Generation from Objects, Phrases and Region Captions: http://cvboy.com/publication/iccv2017_msdn/
Emerging Topics: Human-Object Interaction
Detecting and Recognizing Human-Object Interactions: https://arxiv.org/abs/1704.07333