what is the best multi-stage architecture for object recognition
DESCRIPTION
What is the Best Multi-Stage Architecture for Object Recognition. Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo Li ECE, Duke University Dec. 13rd, 2010. Outline. Introduction Model Architecture Training Protocol Experiments Caltech 101 Dataset - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/1.jpg)
What is the Best Multi-Stage Architecture for Object Recognition
Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun
Presented by Lingbo Li
ECE, Duke University
Dec. 13rd, 2010
![Page 2: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/2.jpg)
Outline
• Introduction
• Model Architecture
• Training Protocol
• Experiments Caltech 101 Dataset NORB Dataset MNIST Dataset
• Conclusions
![Page 3: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/3.jpg)
Introduction (I)Feature extraction stages:A filter bank A non-linear operationA pooling operation
Recognition architectures:
•Single stage of features + supervised classifier: SIFT, HoG, etc.
•Two or more successive stages of feature extractors + supervised classifier: convolutional networks
![Page 4: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/4.jpg)
Introduction (II)
• Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy?
• Q2: Is there any advantage to using an architecture with two successive stages of features extraction, rather than with a single stage?
• Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters?
![Page 5: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/5.jpg)
Model Architecture (I)• Input:
Output:
Filter :
A filter bank layer with 64 filters of size 9x9 :
is the j-th feature map
![Page 6: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/6.jpg)
Model Architecture (II)
• • Subtractive normalization operation
Divisive normalization operation
![Page 7: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/7.jpg)
Model Architecture (III)
•
•
An average pooling layer with 4x4 down-sampling:
A max-pooling layer with 4x4 down-sampling:
![Page 8: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/8.jpg)
Model Architecture (IV)
Combining Modules into a Hierarchy• • •
•
![Page 9: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/9.jpg)
Training Protocol (I) Optimal sparse coding:
Under sparse condition, this problem can be written as an optimization problem:
Given training samples , learning proceeds:
1)Minimize the loss function
2)Find by running a rather expensive optimization algorithm.
![Page 10: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/10.jpg)
Training Protocol (II) Predictive Sparse Decomposition (PSD) PSD trains a regressor to approximate the sparse solution for all training samples, where
Learning proceeds by minimizing the loss function
where
Thus, (dictionary) and (filters) are simultaneously optimized.
![Page 11: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/11.jpg)
Training Protocol (III)A single letter: an architecture with a single stage of feature extraction followed by a classifier;
A double letter: an architecture with two stages if feature extraction followed by a classifier.
Filters are set to random values and kept fixed.
Classifiers are trained in supervised mode.
Filters are trained using unsupervised PSD algorithm, and kept fixed.
Classifiers are trained in supervised model.
Filters are initialized with random values. The entire system (Feature stages + classifiers) is trained in supervised mode with gradient descent.
Filters are initialized with the PSD unsupervised learning algorithm. The entire system (feature stages + classifiers) is trained in supervised mode by gradient descent.
![Page 12: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/12.jpg)
Experiments (I) – Caltech 101
• Data pre-processing:1) Convert to gray-scale and resize to 151x151 pixels;
2) Subtract the image mean and divide by the image standard deviation;
3) Apply subtractive/divisive normalization (N layer with c=1);
4) Zero-padding the shorter side to 143 pixels.
• Recognition rates are averaged over 5 drawings of the training set (30 images per class).
• Hyper-parameters are selected to maximize the performance on the validation set of 5 samples per class taken out of the training sets.
![Page 13: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/13.jpg)
Experiments (I) – Caltech 101
• Using a Single Stage of Feature Extraction:
• Using Two Stages of Feature Extraction:
Multinomial logistic regression
PMK-SVM64 26x26 feature maps
Multinomial logistic regression
PMK-SVM 256 4x4feature maps
64 26x26 feature maps
![Page 14: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/14.jpg)
Experiments (I) – Caltech 101
![Page 15: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/15.jpg)
Experiments (I) – Caltech 101
• Random filters and no filter learning whatsoever with can achieve decent performance; • Supervised fine tuning improves the performance;• Two-stage systems are better than their single-stage
counterparts;• With rectification and normalization , unsupervised training
does not improve the performance;• abs rectification is a crucial component for good
performance;• Single-stage system with PMK-SVM reaches the same
performance with a two-stage with logistic regression;
![Page 16: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/16.jpg)
Experiments (II) – NORB Dataset
• NORB dataset has 5 object categories;
• 24300 training samples and 24300 test samples (4860 per class); Each image is gray-scale with 96x96 pixels;
• Only consider the protocols;
1) Random filters do not perform as well as learned filters with more labels samples.
2) The use of abs and normalization makes a big difference.
![Page 17: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/17.jpg)
Experiments (II) – NORB Dataset
Use gradient descent to find the optimal input patterns in a
architecture.
In the left figure:
•(1-a) random stage-1 filters;
•(1-b) corresponding optimal inputs;
•(2-a) PSD filters;
•(2-b) Optimal input patterns;
•(3) subset of stage-2 filters after PSD and supervised refinement on Caltech-101.(3)
(1-a)
(2-b)(2-a)
(1-b)
![Page 18: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/18.jpg)
Experiments (III) – MNIST Dataset
• 60,000 gray-scale 28x28 pixel images for training and 10,000 images for testing;
• 2-stage of feature extraction:
convolution50 7x7 filters
Max-pooling2*2 windows
50 28x28feature maps
50 14x14feature maps
Input Image34x34
convolution1024 5x5filters
64 10x10feature maps
Max-pooling2x2 windows
64 5x5feature maps
the first stage
the second stage
10-way multinomial
logistic regression
![Page 19: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/19.jpg)
Experiments (III) – MNIST Dataset
• Parameters are trained with PSD: the only hyper-parameter is tuned with a validation set of 10,000 training samples.
• The classifier is randomly initialized;
• The whole system is tuned in supervised mode.
• A test error rate of 0.53% was obtained.
![Page 20: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/20.jpg)
Conclusions (I)
• Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy?
1) A rectifying non-linearity is the single most important factor.
2) A local normalization layer can also improve the performance.
• Q2: Is there any advantage to using an architecture with two successive stages of feature extraction, rather than with a single stage?
1) Two stages are better than one. 2) The performance of two-stage system is similar to that of
the best single-stage systems based on SIFT and PMK-SVM.
![Page 21: What is the Best Multi-Stage Architecture for Object Recognition](https://reader035.vdocument.in/reader035/viewer/2022062304/56812d88550346895d929bce/html5/thumbnails/21.jpg)
Conclusions (II)
• Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters?
1) Random filters yield good performance only in the case of small training set.
2) The optimal input patterns for a randomly initialized stage are similar to the optimal inputs for a stage that use learned filters.
3) The global supervised learning of filters yields good recognition rate if with the proper non-linearites.
4) Unsupervised pre-training followed by supervised refinement yields the best overall accuracy.