Download - 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665

1

Spatial Pyramid Pooling in Deep Convolutional

Networks for Visual Recognition

Presenter ByungIn Yoo

CS688/WST665

2

Contents

● Introduction

● Motivation

● Previous work

● Main Idea

● Details

● Experiments

● Conclusion

3

Introduction

● Web-scale image retrieval

● Classify images or videos

● Detect and localize object

● Estimate semantic and geometrical attributes

● Why is this challenging?

● View point

● Illumination

● Occlusion

● Scale

● Deformation

● Clutter background

4

● The current CNN require a fixed input image size (e.g., 224 x 224 )

● Recognition accuracy is degraded!

Motivation

Crop

Warp

224x224

ConvolutionalNeural Network

(CNN)

Content loss

Distortion

5

● The current CNN require a fixed input image size (e.g., 224 x 224 )

● Recognition accuracy is degraded!

Motivation

Crop

Warp

224x224

ConvolutionalNeural Network

(CNN)

Content loss

Distortion

SpatialPyramidPooling

6

Previous work (1/2)

● Spatial Pyramid Matching

- very successful in traditional computer vision

Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.

7

Previous work (2/2)

● Zeiler-Fergus Architecture (2013, 1st)

● Google LeNet (2014, 1st)

ConvolutionPoolingSoftmaxOther

M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013.Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014.

22 Layers

8 Layers Still low accuracy! & Fixed Image Size

Too complex model! & Fixed Image Size

8

Main Idea (1/2)

● Add Spatial Pyramid Pooling layer!

SPPNet

PreviousNets

9

Main Idea (2/2)

● Generate fixed length representation regardless of image size/scale.

● Simple (still 8 layers) and Powerful Model!

● Variable input size/scale● Multi-size training, Multi-scale testing, Full image view

● Multi-level pooling● Robust to deformation

● Operated on feature map● Pooling in regions

10

Details – Convolutional Layers and Feature Maps

● Inherently, the convolutional layers can accept arbitrary size image.

● Feature map involve not only the strength of the responses, but also their spatial positions.

11

Details – The Spatial Pyramid Pooling Layer

● SPP-net is a new layer with Spatial Pyramid Pooling

Conv1

Conv2

Conv3

Conv4

Conv5

SPP

FC6

FC7

SoftMax

256 filters

256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector

12

Details – Training with the Spatial Pyramid Pooling

● Single-size training● Simply modify the configuration file of CNN frameworks

Conv1

Conv2

Conv3

Conv4

Conv5

SPP

FC6

FC7

SoftMax

Feature map: 13x13

13

Details – Training with the Spatial Pyramid Pooling

● Multiple-size training● Multiple networks sharing all weights

● Each network for a single size. (e.g. 224x224, 180x180)

● Improve scale-invariance

resize

14

Details – Fast CNN-based Object Detection

● The features can be computed from entire image only once.

● Similar accuracy, much faster (24x~64x) than R-CNN

2000 Convolutions! 1 Convolution!

15

Experiments (1/4)

● ILSVRC image classification task

● 1000 object classes (1,431,167 images)

16

Experiments (2/4)

● ILSVRC image classification task (rank #3)

● SPP improves all CNN architectures

Top-5 test accuracy

Top-5 val. accuracy

17

Experiments (3/4)

● ILSVRC image detection task

● Fully annotated 200 object classes across 121,931 images

● Allows evaluation of generic object detection in cluttered scenes at

scale

Detected Region

Ground-truth

:True

:False

18

Experiments (4/4)

● ILSVRC image detection task (rank #2)

● More practical than R-CNN

19

Conclusion

● SPP is flexible solution for handling different scales, sizes, and aspect ration.

● Spatial Pyramid Pooling improves accuracy.

● Multi-size training improves accuracy.

● Full-image representation improves accuracy.

● Classification: SPP improves all CNNs in the literature.

● Detection: Practical, fast and accurate than R-CNN.

Download - 1 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition Presenter ByungIn Yoo CS688/WST665

Top Related