![Page 1: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/1.jpg)
1
Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition
Presenter ByungIn Yoo
CS688/WST665
![Page 2: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/2.jpg)
2
Contents
● Introduction
● Motivation
● Previous work
● Main Idea
● Details
● Experiments
● Conclusion
![Page 3: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/3.jpg)
3
Introduction
● Web-scale image retrieval
● Classify images or videos
● Detect and localize object
● Estimate semantic and geometrical attributes
● Why is this challenging?
● View point
● Illumination
● Occlusion
● Scale
● Deformation
● Clutter background
![Page 4: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/4.jpg)
4
● The current CNN require a fixed input image size (e.g., 224 x 224 )
● Recognition accuracy is degraded!
Motivation
Crop
Warp
224x224
ConvolutionalNeural Network
(CNN)
Content loss
Distortion
![Page 5: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/5.jpg)
5
● The current CNN require a fixed input image size (e.g., 224 x 224 )
● Recognition accuracy is degraded!
Motivation
Crop
Warp
224x224
ConvolutionalNeural Network
(CNN)
Content loss
Distortion
SpatialPyramidPooling
![Page 6: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/6.jpg)
6
Previous work (1/2)
● Spatial Pyramid Matching
- very successful in traditional computer vision
Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.
![Page 7: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/7.jpg)
7
Previous work (2/2)
● Zeiler-Fergus Architecture (2013, 1st)
● Google LeNet (2014, 1st)
ConvolutionPoolingSoftmaxOther
M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013.Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014.
22 Layers
8 Layers Still low accuracy! & Fixed Image Size
Too complex model! & Fixed Image Size
![Page 8: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/8.jpg)
8
Main Idea (1/2)
● Add Spatial Pyramid Pooling layer!
SPPNet
PreviousNets
![Page 9: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/9.jpg)
9
Main Idea (2/2)
● Generate fixed length representation regardless of image size/scale.
● Simple (still 8 layers) and Powerful Model!
● Variable input size/scale● Multi-size training, Multi-scale testing, Full image view
● Multi-level pooling● Robust to deformation
● Operated on feature map● Pooling in regions
![Page 10: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/10.jpg)
10
Details – Convolutional Layers and Feature Maps
● Inherently, the convolutional layers can accept arbitrary size image.
● Feature map involve not only the strength of the responses, but also their spatial positions.
![Page 11: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/11.jpg)
11
Details – The Spatial Pyramid Pooling Layer
● SPP-net is a new layer with Spatial Pyramid Pooling
Conv1
Conv2
Conv3
Conv4
Conv5
SPP
FC6
FC7
SoftMax
256 filters
256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector
![Page 12: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/12.jpg)
12
Details – Training with the Spatial Pyramid Pooling
● Single-size training● Simply modify the configuration file of CNN frameworks
Conv1
Conv2
Conv3
Conv4
Conv5
SPP
FC6
FC7
SoftMax
Feature map: 13x13
![Page 13: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/13.jpg)
13
Details – Training with the Spatial Pyramid Pooling
● Multiple-size training● Multiple networks sharing all weights
● Each network for a single size. (e.g. 224x224, 180x180)
● Improve scale-invariance
resize
![Page 14: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/14.jpg)
14
Details – Fast CNN-based Object Detection
● The features can be computed from entire image only once.
● Similar accuracy, much faster (24x~64x) than R-CNN
2000 Convolutions! 1 Convolution!
![Page 15: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/15.jpg)
15
Experiments (1/4)
● ILSVRC image classification task
● 1000 object classes (1,431,167 images)
![Page 16: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/16.jpg)
16
Experiments (2/4)
● ILSVRC image classification task (rank #3)
● SPP improves all CNN architectures
Top-5 test accuracy
Top-5 val. accuracy
![Page 17: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/17.jpg)
17
Experiments (3/4)
● ILSVRC image detection task
● Fully annotated 200 object classes across 121,931 images
● Allows evaluation of generic object detection in cluttered scenes at
scale
Detected Region
Ground-truth
:True
:False
![Page 18: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/18.jpg)
18
Experiments (4/4)
● ILSVRC image detection task (rank #2)
● More practical than R-CNN
![Page 19: 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665](https://reader035.vdocument.in/reader035/viewer/2022062516/56649d6f5503460f94a51a1f/html5/thumbnails/19.jpg)
19
Conclusion
● SPP is flexible solution for handling different scales, sizes, and aspect ration.
● Spatial Pyramid Pooling improves accuracy.
● Multi-size training improves accuracy.
● Full-image representation improves accuracy.
● Classification: SPP improves all CNNs in the literature.
● Detection: Practical, fast and accurate than R-CNN.