thesis title: “studies in pattern classification – biological modeling, uncertainty reasoning,...

36
Thesis title: “Studies in Pattern Classification – Biological Modeling, Uncertainty Reasoning, and Statistical Learning” 3 parts: (1)Handwritten Digit Recognition with a Vision-Based Model (part in CVPR-2000) (2)An Uncertainty Framework for Classification (UAI-2000) (3)Selection of Support Vector Kernel Parameters (ICML-2000)

Upload: kathryn-sparks

Post on 18-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

• Thesis title: “Studies in Pattern Classification – Biological

Modeling, Uncertainty Reasoning, and Statistical Learning”

• 3 parts:(1) Handwritten Digit Recognition with a Vision-

Based Model (part in CVPR-2000)(2) An Uncertainty Framework for Classification

(UAI-2000)(3) Selection of Support Vector Kernel Parameters

(ICML-2000)

Handwritten Digit Recognitionwith a Vision-Based Model

Loo-Nin Teow & Kia-Fock Loe

School of Computing

National University of Singapore

OBJECTIVE

• To develop a vision-based system that extracts features for handwritten digit recognition based on the following principles:– Biological Basis;– Linear Separability;– Clear Semantics.

Developing the model

2 main modules:

• Feature extractor – generates feature vector from raw pixel map.

• Trainable classifier– outputs the class based on the feature vector.

General System Structure

Handwritten Digit Recognizer

Feature

Extractor

Feature

Classifier

Raw

Pixel

Map

Feature

Vector

Digit

Class

The Biological Visual System

Primary Visual Cortex

Eye

Opticnerve

Optictract

Opticchiasm

Brain

Lateralgeniculate

nucleus

Opticradiation

Receptive Fields

Visual map

Visual cell

Receptive

field input

Output activations

Simple Cell Receptive Fields

Simple Cell Responses

Cases with

activation

Cases without

activation

Hypercomplex Receptive Fields

Hypercomplex Cell Responses

Cases without

activationCases with

activation

Biological Vision

• Local spatial features;

• Edge and corner orientations;

• Dual-channel (bright/dark; on/off);

• Non-hierarchical feature extraction.

The Feature Extraction Process Selective Feature Convolution Aggregation

I

Q

F

I

I

Q

F

2 of 36x36 32 of 32x32 32 of 9x9

Dual Channel

• On-Channel intensity-normalize (Image)

• Off-Channel complement (On-Channel)

ueMaxGrayVal

YXYX

),(),(

II

),(1),( YXYX II

Selective Convolution

• Local receptive fields– same spatial features at different locations.

• Truncated linear halfwave rectification– strength of feature’s presence.

• “Soft” selection based on central pixel– reduce false edges and corners.

Selective Convolution (formulae)

where

),(),(),( YXYXYX jj GIQ

r

rM

r

rNjj NYMXNMYX ),(),(),( IHG

otherwise0

0 if)(

zzz

Convolution Mask Templates

• Simplified models of the simple and hypercomplex receptive fields.

• Detect edges and end-stops of various orientations.

• Corners - more robust than edges– On-channel end-stops : convex corners;– Off-channel end-stops : concave corners.

Some representatives ofthe 16 mask templates used

in the feature extraction

-1 1 -2 1-1 1 -1 1-1 1 -2 1-1 1 1-1 1 1

-1 -1 -1 -1 -8 -8 -1-8 -8 2 -1

2 2 2 2 -8 -1 2 -2-8 -1 2

-1 -1 -1 -1 -2 2

Feature Aggregation

• Similar to subsampling:– reduces number of features;– reduces dependency on features’ positions;– local invariance to distortions and translations.

• Different from subsampling:– magnitude-weighted averaging;– detects presence of feature in window; – large window overlap.

Feature Aggregation (formulae)

Magnitude-Weighted Average

where

1

0

1

0

),(),(),(v

M

v

NjjXYj NuYMuXNMYX QWF

1

0

1

0

),(

),(),( v

S

v

Tj

j

jXY

TuYSuX

NuYMuXNM

Q

QW

Classification

• Linear discrimination systems– Single-layer Perceptron Network

minimize cross-entropy cost function.– Linear Support Vector Machines

maximize interclass margin width.

• k-nearest neighbor– Euclidean distance– Cosine Similarity

xxx

x

x

x

oo

o

o

oo

xxx

x

x

x

o

o

o

o

oo

Multiclass Classification Schemesfor linear discrimination systems

• One-per-class (1 vs 9)

• Pairwise (1 vs 1)

• Triowise (1 vs 2)

Experiments

• MNIST database of handwritten digits.

• 60000 training, 10000 testing.

• 36x36 input image.

• 32 9x9 feature maps.

Preliminary ExperimentsFeature

ClassifierScheme Voting

OptionTrain Error (%)(60000 samples)

Test Error (%)(10000 samples)

 PerceptronNetwork

1-per-class - 0.00 2.14

Pairwise Hard 0.00 0.88

Soft 0.00 0.87

Triowise Hard 0.00 0.72

Soft 0.00 0.72

 LinearSVMs

Pairwise Hard 0.00 0.98

Soft 0.00 0.82

Triowise Hard 0.00 0.74

Soft 0.00 0.72

 k-NearestNeighbor

EuclideanDistance

- 0.00 1.39(k = 3)

CosineSimilarity

- 0.00 1.09(k = 3)

Experiments on Deslanted Images

FeatureClassifier

Scheme VotingOption

Train Error (%)(60000 samples)

Test Error (%)(10000 samples)

 PerceptronNetwork

Pairwise Hard 0.00 0.81

Soft 0.00 0.73

Triowise Hard 0.00 0.63

Soft 0.00 0.62

 LinearSVMs

Pairwise Hard 0.00 0.69

Soft 0.00 0.68

Triowise Hard 0.00 0.65

Soft 0.00 * 0.59 *

Misclassified Characters

Comparison with Other Models

Classifier Model Test Error (%)

LeNet-4 1.10

LeNet-4, boosted [distort] 0.70

LeNet-5 0.95

LeNet-5 [distort] 0.80

Tangent distance 1.10

Virtual SVM 0.80

< Our model > [deslant] * 0.59 *

Conclusion

• Our model extracts features that are– biologically plausible;– linearly separable;– semantically clear.

• Needs only a linear classifier– relatively simple structure;– trains fast;– gives excellent classification performance.

Hierarchy of Features?

• Idea originated from Hubel & Wiesel– LGN simple complex hypercomplex– later studies show these to be parallel.

• Hierarchy - too many feature combinations.

• Simpler to have only one convolution layer.

Linear Discrimination

Output:

where f defines a hyperplane:

and g is the activation function:

or

)(xfgp

)exp(1

1)(

zzg

bf xwx)(

1 if1

11 if

1 if1

)(

z

zz

z

zg

One-per-class Classification

• the unit with the largest output value indicates the class of the character:

A

ApA argmax*

Pairwise Classification

Soft Voting:

Hard Voting:

where

AB

BAAB

AppA argmax*

z

zz

if1

if1)(

AB

BAAB

AppA )()(argmax*

Triowise Classification

Soft Voting:

Hard Voting:

CBACAB

CABBACABC

ApppA )()()(argmax*

CBACAB

CABBACABC

ApppA argmax*

k-Nearest Neighbor

Euclidean Distance

Cosine Similarity

where

traintest xx

traintest

traintest

xx

xx

zzz

Confusion Matrix (triowise SVMs / soft voting / deslanted)

Class 0 1 2 3 4 5 6 7 8 9 # errors

0 977 0 0 0 0 0 2 1 0 0 3

1 0 1134 1 0 0 0 0 0 0 0 1

2 1 0 1023 1 1 0 0 5 1 0 9

3 0 0 1 1005 0 4 0 0 0 0 5

4 0 0 0 0 975 0 1 1 1 4 7

5 1 0 0 3 0 887 1 0 0 0 5

6 5 2 1 0 1 1 948 0 0 0 10

7 0 1 2 1 0 0 0 1022 0 2 6

8 0 0 1 0 0 1 0 0 972 0 2

9 0 0 0 0 7 2 0 1 1 998 11

Number of iterations to convergence for the perceptron network

Scheme #units # epochs

1-per-class 10 281

Pairwise 90 57

Triowise 360 147