fully convolutional networks for semantic...
TRANSCRIPT
Fully Convolutional Networks for Semantic SegmentationBy Jonathan Long* Evan Shelhamer* Trevor Darrell
Instance-sensitive Fully Convolutional NetworksBy Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun
Presented by Zilong [email protected]
Outline1. What problems they attempt to solve?
2. Key Contributions
3. Network Architecture Details
4. Experimental Setup and Results
5. Strengths and Weaknesses*
6. Possible Extensions*
a. And other comments
UC Berkeley
Fully Convolutional Networksfor Semantic Segmentation
Jonathan Long* Evan Shelhamer* Trevor Darrell
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Problem to solve:Image Segmentation. Pixels in, pixels out.
Semanticsegmentation
Monocular depth estimation Eigen & Fergus 2015
Boundary prediction Xie & Tu 2015Optical flow Fischer et al. 2015
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Problem to solveWhat is semantic segmentation?Input: Image (2D array of pixels)Output: Pixels clustered according to their semantical categories.
I.e. Class-level pixel-wise clustering (supervised)
NOTE: pixels of two people in the same image will be clustered together by this model. Second paper attempts to fill in the blank of this area ....
Input Output
Image Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Key Contributions1) AlexNet (VGG, GoogLeNet) -> Fully Convolutional Network
a) From image-level classification to pixel-level clustering
b) Arbitrary sized input images*
c) End-to-end learning model
2) Skip-layer structure to improve segmentation detail
a) Combine deep, coarse, semantic information with shallow, fine,
appearance information.
b) WHAT (deeper layers) + WHERE(shallower layers)
7
“tabby cat”
1000-dim vector
< 1 millisecond
Convnets perform classification
end-to-end learning
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
“tabby cat”
8
Recall: a classification network
NOTE: Implement layer 6 and 7 as fully connected layers fixes the size of input images
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
9
Recall: R-CNNObject detection without modifying AlexNet architecture
figure: Girshick et al.
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
R-CNN
10
Many seconds
“cat”
“dog”
Recall:R-CNN does detection
Whether using off-the-shelf methods or in-network layers for region proposals, bounding boxes are always needed in these approaches
SLOWContent Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
11
~1/10 second
end-to-end learning
???
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
“tabby cat”
12
A classification network (see it again)
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
13
How to become fully convolutional
To be honest, fully convolutional, is just another way of thinking…
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Becoming fully convolutional
To be honest, fully convolution, is just another way of thinking…
But it makes significant difference in training and maintaining the network structure in implementation!- Only convolution kernels are maintained; downsampling ratios are controlled by strides.- Arbitrary size- Faster! Compare to naive implementation
Layer 6 can be generated with kernel 13 x 13 x d_5, stride = 0: a kernel that does not move aroundLayer 7 can be generated with kernel 1 x 1 x d_6, stride = 0: another kernel that does not move around
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
15
Now it is fully convolutional
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
16
Upsampling output
NOTE: Upsampled output is H x W x (class number + 1)
Each H x W slice shows the heat map for one category
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
17
End-to-end &Pixels-to-pixels network
Each semantic segmentation ground truth image actually needs to be divided into (class number + 1) slices and each slice corresponds to the ground truth heat map of one category.
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
conv, pool,nonlinearity
upsampling
pixelwiseoutput + loss
End-to-end, pixels-to-pixels network
18Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
stride 32
no skips
input image
If stopped right here, what could we get?
19
Coarse. Really, really coarse �
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Spectrum of deep features
Combine where (local, shallow) with what (global, deep)
fuse features into deep jet
(cf. Hariharan et al. CVPR15 “hypercolumn”)
20Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Skip layers
skip to fuse layers!
Interp + sum
Interp + sum
dense output 21
End-to-end, joint learningof semantics and location
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Skip layers
22Content Source: https://computing.ece.vt.edu/~f15ece6504/slides/L13_FCN.pdf
Skip layers
23
How exactly are layers fused?
Take FCN-16s for instance: fusing pool4 and conv 7 in the following steps:
1. Add a 1 x 1 convolution layer on top of pool4 to produce additional class predictions. a. The output predictions of pool4 are 16s
2. 2x upsample the output of conv 7 which are 32s. a. The output predictions of upsampled conv 7 are 16s as well.
3. Add these 16s predictions together.4. Upsample these 16s predictions back to image size.NOTE: ALL the weights can be learned. The upsampling weights can be initialized with bilinear interpolation.
stride 32
no skips
stride 16
1 skip
stride 8
2 skips
ground truthinput image
Skip layer refinement
24Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Training + Testing- Train full image at a time without patch sampling - Reshape network to take input of any size- Forward time is ~100ms for 500 x 500 x 21 output (This is really fast!)
25Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
Qualitative Results
FCN SDS* Truth Input
26
Relative to prior state-of-the-art SDS:
- 30% relative improvementfor mean IoU
- 286× faster
*Simultaneous Detection and Segmentation Hariharan et al. ECCV14
resultsFCN SDS* Truth Input
27
Relative to prior state-of-the-art SDS:
- 30% relative improvementfor mean IoU
- 286× faster
*Simultaneous Detection and Segmentation Hariharan et al. ECCV14
Ghosts sitting on that boat?!!
Qualitative Results
Experimental Setup1) AlexNet architecture2) VGG nets, pick the VGG 16-layer net5 3) GoogLeNet, use only the final loss layer, and improve performance by
discarding the final average pooling layer.
*Decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions.
results
29
Quantitative Results
results
30
Quantitative ResultsSIFT FLOW NYUDv2
PASCAL VOC 2011 8498-training
Content Source: https://computing.ece.vt.edu/~f15ece6504/slides/L13_FCN.pdf
Potential Extensions
A boring extension: if we directly use shallower layers and upsample without fusing with deeper layers, how bad would it be?
An interesting, promising and intuitive extension:What the next paper attempted to address=>
Instance-sensitive Fully Convolutional Networks
Jifeng Dai, Kaiming He, Jian Sun. Microsoft ResearchYi Li. Tsinghua University (While interning at Microsoft Research)Shaoqing Ren.University of Science and Technology of China (While interning at Microsoft Research)
32
Problem to solve:Instance-level Segmentation. Pixels in, pixels out.
33
Problem to solve:Instance-level Segmentation. Pixels in, pixels out.
34
Major ContributionsA fully convolutional network architecture that:1) Computes a set of instance-sensitive score maps
a) Each pixel is a classifier of relative positions to an object instance
b) Assemble to output instance candidate at each position
2) Reuse semantic segmentation results
3) Exploits image local coherence
a) w/o any high-dimensional layer related to the mask resolution
(compare with DeepMask)
Major ContributionsA fully convolutional network architecture for
instance-level segmentation.
37
Recall:Upsampling output
NOTE: Upsampled output is H x W x (class number + 1)
Each H x W slice shows the heat map for one category
38
> Generate instance-sensitive score maps > Assemble
Generate a set of k x k instance-sensitive score maps (for instance k = 3)
#1 #2 #3
#4 #5 #6
#7 #8 #9
#1 #2
#4
#3
#5 #6
#7 #8 #9
m x mx (k x k)
39
> Generate instance-sensitive score maps > Assemble
Generate a set of k x k instance-sensitive score maps (for instance k = 3)
#1 #2 #3
#4 #5 #6
#7 #8 #9
#1 #2
#4
#3
#5 #6
#7 #8 #9
NOTE: Not all positions the sliding window visited were objects.
m x mx (k x k)
Complete Instance-level Segmentation Network -2 BranchesUpper: Generate instance-sensitive score maps and assembleBottom: Generate objectness scores
Experimental Setup1) Use the VGG-16 network pre-trained on ImageNet as the feature extractor. 2) The 13 convolutional layers in VGG-16 are applied fully convolutionally on
an input image of arbitrary size.3) Reduce the network stride and increase feature map resolution:
a) the max pooling layer pool4 (between conv4_3 and conv5_1) is modified to have a stride of 1 instead of 2,
b) accordingly the filters in conv5_1 to conv5_3 are adjusted by the “hole algorithm”.
*Using this modified VGG network, the effective stride of the conv5_3 feature map is s = 8 pixels w.r.t. the input image.
DeepMaskLooks similar, but it doesn’t know how to use the local coherence
Quantitative ResultsAblation comparisons on the PASCAL VOC 2012 validation set
Quantitative ResultsPerformance evaluations on PASCAL VOC 2012Validation set
Quantitative ResultsPerformance evaluations on MS COCOValidation set
Qualitative Result
Qualitative Result
Strengths and WeaknessesStrengths:1) Both papers addressed very important questions with fully convolutional networks efficiently.2) Both papers have novelty with respect to network architectures.3) Both papers have convincing experiments.
a) Visualization and numerical results are clear and convincing.4) The discussion on the convolution operations in the first paper is helpful for interpretation and
better understanding of convolutional networks.5) The second paper doesn’t require another process to generate region proposals.
Weaknesses:1) How to use the training data is never clearly addressed.
a) What ground truth is used together with the forwarded heap maps for the loss functions?i) The first paper is intuitive in this part, but the second paper is very confusing.
2) Several essential points are unclear in the second papera) Did the second paper skip layers? b) Where did the second paper upsample? Or they just did not?
3) The relative location grids in the second paper worked well but look strange:a) One person’s “left” could be the other’s “right”, but each channel is in charge of the
relative location of all sliding windows.
Potential directions
1) Other tasks to be resolved by fully convolutional networksa) Scene recognition?
(1) Semantical combination of objects
2) Why is the size of sliding windows fixed in the second paper?a) Many small instances crowded together.
3) What about combining box-level object recognition with semantic segmentation?
Image Source: https://www.pinterest.com/pin/369787819374178444/https://www.pinterest.com/pin/399553798160612769/
Backup SlidesDatasets:
+ NYUD net for multi-modal input and SIFT Flow net for multi-task output
PASCAL VOC Table 3 gives the performance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [17], and the well-known R-CNN [12]. NYUDv2 [33] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [14].
SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”), as well as three geometric categories (“horizontal”, “vertical”,and “sky”).
Past and future history offully convolutional networks
51Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
history
Convolutional Locator NetworkWolf & Platt 1994
Shape Displacement NetworkMatan & LeCun 1992
52Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
53
Scale Pyramid, Burt & Adelson ‘83
pyramids
0 1 2
The scale pyramid is a classic multi-resolution representation.
Fusing multi-resolution network layers is a learned, nonlinear counterpart.
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
54
Jet, Koenderink & Van Doorn ‘87
jets
The local jet collects the partial derivatives at a point for a rich local description.
The deep jet collects layer compositions for a rich,learned description.
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
55
extensions
- more tasks- random fields- weak supervision
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
many pixelwise tasks
semanticsegmentation
56
monocular depth estimation Eigen & Fergus 2015
boundary prediction Xie & Tu 2015optical flow Fischer et al. 2015Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
fully conv. nets + random fields
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.Chen* & Papandreou* et al. ICLR 2015. 57Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
fully conv. nets + random fields
Conditional Random Fields as Recurrent Neural Networks. Zheng* & Jayasumana* et al. arxiv 2015. 58Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
[ comparison credit: CRF as RNN, Zheng* & Jayasumana* et al. ICCV 2015 ]
59DeepLab: Chen* & Papandreou* et al. ICLR 2015. CRF-RNN: Zheng* & Jayasumana* et al. ICCV 2015Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
fully conv. nets + weak supervision
Constrained Convolutional Neural Networks for Weakly Supervised Segmentation.Pathak et al. arXiv 2015.
FCNs expose a spatial loss map to guide learning:segment from tags by MIL or pixelwise constraints.
60Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
fully conv. nets + weak supervision
BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation.Dai et al. 2015.
FCNs expose a spatial loss map to guide learning:mine boxes + feedback to refine masks.
61Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
leaderboard
== segmentation with Caffe
62
FCNFCNFCNFCNFCNFCNFCNFCNFCNFCNFCN
FCNFCNFCN
FCN
Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
caffeinated contemporaries
Hypercolumn SDSHariharan, Arbeláez,Girshick, Malik
Zoom-OutMostajabi, Yadollahpour,Shaknarovich
Convolutional Feature MaskingDai, He, Sun
63Content Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7
fcn.berkeleyvision.org
conclusionfully convolutional networks are fast, end-to-end models for pixelwise problems
- code in Caffe master branch- models for PASCAL VOC, NYUDv2,
SIFT Flow, PASCAL-Context
64
caffe.berkeleyvision.org
github.com/BVLC/caffe
model exampleinference examplesolving exampleContent Source: https://docs.google.com/presentation/d/1VeWFMpZ8XN7OC3URZP4WdXvOGYckoFWGVN7hApoXVnc/edit#slide=id.g529579d43_3_7