what is the best multi-stage architecture for object recognition

What is the Best Multi-Stage Architecture for Object Recognition

Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun

Presented by Lingbo Li

ECE, Duke University

Dec. 13rd, 2010

Outline

• Introduction

• Model Architecture

• Training Protocol

• Experiments Caltech 101 Dataset NORB Dataset MNIST Dataset

• Conclusions

Introduction (I)Feature extraction stages:A filter bank A non-linear operationA pooling operation

Recognition architectures:

•Single stage of features + supervised classifier: SIFT, HoG, etc.

•Two or more successive stages of feature extractors + supervised classifier: convolutional networks

Introduction (II)

• Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy?

• Q2: Is there any advantage to using an architecture with two successive stages of features extraction, rather than with a single stage?

• Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters?

Model Architecture (I)• Input:

Output:

Filter :

A filter bank layer with 64 filters of size 9x9 :

is the j-th feature map

Model Architecture (II)

• • Subtractive normalization operation

Divisive normalization operation

Model Architecture (III)

•

•

An average pooling layer with 4x4 down-sampling:

A max-pooling layer with 4x4 down-sampling:

Model Architecture (IV)

Combining Modules into a Hierarchy• • •

•

Training Protocol (I) Optimal sparse coding:

Under sparse condition, this problem can be written as an optimization problem:

Given training samples , learning proceeds:

1)Minimize the loss function

2)Find by running a rather expensive optimization algorithm.

Training Protocol (II) Predictive Sparse Decomposition (PSD) PSD trains a regressor to approximate the sparse solution for all training samples, where

Learning proceeds by minimizing the loss function

where

Thus, (dictionary) and (filters) are simultaneously optimized.

Training Protocol (III)A single letter: an architecture with a single stage of feature extraction followed by a classifier;

A double letter: an architecture with two stages if feature extraction followed by a classifier.

Filters are set to random values and kept fixed.

Classifiers are trained in supervised mode.

Filters are trained using unsupervised PSD algorithm, and kept fixed.

Classifiers are trained in supervised model.

Filters are initialized with random values. The entire system (Feature stages + classifiers) is trained in supervised mode with gradient descent.

Filters are initialized with the PSD unsupervised learning algorithm. The entire system (feature stages + classifiers) is trained in supervised mode by gradient descent.

Experiments (I) – Caltech 101

• Data pre-processing:1) Convert to gray-scale and resize to 151x151 pixels;

2) Subtract the image mean and divide by the image standard deviation;

3) Apply subtractive/divisive normalization (N layer with c=1);

4) Zero-padding the shorter side to 143 pixels.

• Recognition rates are averaged over 5 drawings of the training set (30 images per class).

• Hyper-parameters are selected to maximize the performance on the validation set of 5 samples per class taken out of the training sets.


• Using a Single Stage of Feature Extraction:

• Using Two Stages of Feature Extraction:

Multinomial logistic regression

PMK-SVM64 26x26 feature maps

Multinomial logistic regression

PMK-SVM 256 4x4feature maps

64 26x26 feature maps


• Random filters and no filter learning whatsoever with can achieve decent performance; • Supervised fine tuning improves the performance;• Two-stage systems are better than their single-stage

counterparts;• With rectification and normalization , unsupervised training

does not improve the performance;• abs rectification is a crucial component for good

performance;• Single-stage system with PMK-SVM reaches the same

performance with a two-stage with logistic regression;

Experiments (II) – NORB Dataset

• NORB dataset has 5 object categories;

• 24300 training samples and 24300 test samples (4860 per class); Each image is gray-scale with 96x96 pixels;

• Only consider the protocols;

1) Random filters do not perform as well as learned filters with more labels samples.

2) The use of abs and normalization makes a big difference.

Experiments (II) – NORB Dataset

Use gradient descent to find the optimal input patterns in a

architecture.

In the left figure:

•(1-a) random stage-1 filters;

•(1-b) corresponding optimal inputs;

•(2-a) PSD filters;

•(2-b) Optimal input patterns;

•(3) subset of stage-2 filters after PSD and supervised refinement on Caltech-101.(3)

(1-a)

(2-b)(2-a)

(1-b)

Experiments (III) – MNIST Dataset

• 60,000 gray-scale 28x28 pixel images for training and 10,000 images for testing;

• 2-stage of feature extraction:

convolution50 7x7 filters

Max-pooling2*2 windows

50 28x28feature maps


Input Image34x34

convolution1024 5x5filters


Max-pooling2x2 windows

64 5x5feature maps

the first stage

the second stage

10-way multinomial

logistic regression

Experiments (III) – MNIST Dataset

• Parameters are trained with PSD: the only hyper-parameter is tuned with a validation set of 10,000 training samples.

• The classifier is randomly initialized;

• The whole system is tuned in supervised mode.

• A test error rate of 0.53% was obtained.

Conclusions (I)

• Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy?

1) A rectifying non-linearity is the single most important factor.

2) A local normalization layer can also improve the performance.

• Q2: Is there any advantage to using an architecture with two successive stages of feature extraction, rather than with a single stage?

1) Two stages are better than one. 2) The performance of two-stage system is similar to that of

the best single-stage systems based on SIFT and PMK-SVM.

Conclusions (II)

• Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters?

1) Random filters yield good performance only in the case of small training set.

2) The optimal input patterns for a randomly initialized stage are similar to the optimal inputs for a stage that use learned filters.

3) The global supervised learning of filters yields good recognition rate if with the proper non-linearites.

4) Unsupervised pre-training followed by supervised refinement yields the best overall accuracy.

what is the best multi-stage architecture for object recognition

Documents

stages of feature extraction

supervised model

random filters

training samples

training sets

supervised manner

best multistage architecture

hierarchy training protocol