image classification with neurally ... - iro.umontreal.ca

Image Classification withNeurally Motivated Higher-Order Models

James Bergstra, Yoshua Bengio, Jerome Louradour

April 2008

James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models

Summary

Neuroscience Activation Function

(Hubel & Wiesel, ***) → logistic sigmoid(b +∑

i Wixi )(Rust et ***, 2005) → this work

Models inspired from (Rust et al. 2005)

learn salient pairwise features of input;

performs well on image classification tasks;

are computationally cheap

motivate more accurate neural modelling

Q. Why do they work? When?A. We don’t know yet. Conjectures, future work


Logistic Sigmoid

response = sigm(w · x + b) =1

1 + exp(−w · x − b)

approximation of V1 simple cell firing rate

computationally cheap

differentiable → learnable

two layer nets (possibly fat) are universal approximators

treat each xi independently → no XOR without layers


(Rust et. al, 2005) V1 model

Rust, N., Schwartz, O., Movshon, J. A., & Simoncelli, E. (2005).Spatiotemporal elements of macaque V1 receptive fields. Neuron,46, pp.945–956.

response(E ,S) = α +βE ζ − δSζ

1 + γE ζ + εSζ

E =√

max(0,w ′x)2 + x ′V ′Vx , S =√

x ′U ′Ux

separate positive scalar Excitation and Shunting inhibition

better approximation of V1 simple and complex cell

system-level account, not neuron model

low rank V ,U → computationally [almost as] cheap

can capture non-linear dependencies among neuron’s inputs


Minsky and Papert’s HPU

hpu(x) = act(b +∑

i

Φi (xi )

+∑ij

Φij(xi , xj)

+∑ijk

Φijk(xi , xj , xk) · · · )

order > 1 → capable of XOR

computationally expensive O(inputsizeorder)

many parameters to learn for higher-orders


Our quadratic models

hshunt,i (x) =Ei (x)− Si (x)

1 + Ei (x) + Si (x)

hratio,i (x) =Ei (x)

1 + Ei (x)

hquad ,i (x) = sigmoid(bi + w ′i x + x ′V ′i Vix − x ′U ′i Uix)

Ei (x) =√

x ′V ′i Vix + log[1 + exp(wi · x)]2, Si (x) =

√x ′U ′i Uix

log(1 + exp(z)): differentiable version of max(0, z)

Rust model exponents ζ fixed to 1 for learning

Rust model coefficients β, δ, γ, ε fixed to 1


Different Squashing Functions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

sigmoid(0.7*x)ratio(x)

-0.9

-0.8

-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0 5 10 15 20

log(sigmoid(0.7*x))log(ratio(x))

sigmoid(x) = 1/(1 + e−x)

ratio(x) = log(1 + ex)/(1 + log(1 + ex))


Classifying Features

Linear output classifier:

f (x) = by + Wyh<model>(x)

Softmax for predicted class distribution:

p(class i | x) =efi (x)∑j efj (x)

Negative log likelihood training criterion:

Loss(x , target) = − log p(target | x)

Learn Wy , by , h<model> by gradient descent with early stoppingheuristic.


Results: MNIST

Error Rates (%)Family Units K Valid Test

sigm1 - 1.8 1.9sigm3 - 2.1 2.4SVM - - - 1.4quad 80 ×8 1.8 1.9ratio 320 ×8 1.2 1.4shunt 40 ×8 1.4 1.5

Table: MNIST results for the best model in each family. Units is thehidden layer width (number of neurons). Validation-selected Kcomponents of higher-order interactions. Random guessing would score90% error.


Datasets ShapeSet1, ShapeSet2

Figure: Sample inputs from ShapeSet1 (top) and ShapeSet2 (bottom).

Ten thousand images for training. (5K valid, 5K test)

Motivated by Baby-AI project.

Why not scale and centre? Doesn’t scale to complex scenes!


Results: ShapeSet1


SVM - - 29.6 29.3± 1.sigm 200 - 13.5 14.0± 1.sigm2 200 ×2 - 6.8 7.2± .8sigm3 200 ×3 - 5.4 5.9± .7quad 20 ×4 6.3 7.4± .8ratio 40 ×8 2.1 2.4 ±.4shunt 40 ×16 2.9 3.2 ±.5

Table: Performance on Shapeset1 for the best model in each family.Units is the hidden layer width (number of neurons). The best modelsused K components of higher-order interactions. Random guessing wouldscore 66.7% error. A 95% confidence interval is given on test errors.

hquad , hratio , hshunt make the most of a few unitsbest models always had quadratic interactions K > 0


Results: ShapeSet2


SVM - - 42.2 44± 2sigm 200 - 36.6 36± 1sigm2 500 ×2 - 24.0 24± 1sigm3 500 ×3 - 22.3 22± 1ratio 40 ×8 14.3 16 ± 1shunt 80 ×8 24.7 26 ± 1

Table: Performance on Shapeset2 for the best model in each family.Units is the hidden layer width (number of neurons). The best modelsused K components of higher-order interactions. Random guessing wouldscore 66.7% error. A 95% confidence interval is given on test errors.


2nd -order translational invariance

Family Units K Test Quadratic?SVM - - 29.3± 1. nosigm 200 - 14.0± 1. noquad 20 ×4 7.4± .8 yessigm2 200 ×2 - 7.2± .8 nosigm3 200 ×3 - 5.9± .7 noratio 40 ×8 2.4 ±.4 yesshunt 40 ×16 3.2 ±.5 yes

2nd -order modelscan be translationinvariant.

x_i

x_jUse the slope between features of thesubject (Reid et al,1989).

x : pixelized image

Wij = g(slope between xi , xj)

f (x) = a +∑

ij Wijxixj

f (x) non-trivial, independent of translation (not including bordereffects, cropping). f generalizes over subject position!


Gradual saturationFamily Units K Test ratio?SVM - - 29.3± 1. nosigm 200 - 14.0± 1. noquad 20 ×4 7.4± .8 nosigm2 200 ×2 - 7.2± .8 nosigm3 200 ×3 - 5.9± .7 noratio 40 ×8 2.4 ±.4 yesshunt 40 ×16 3.2 ±.5 yes

Ratio-based unitseasier to optimizeby gradient?

logistic gradient:ddx (1/(1 + e−x)) = e−x

(1+e−x )2

ratio gradient:ddx (x/(1 + x)) = (1− x)−2

limx→∞

ddx (1/(1 + exp(−x)))

ddx (x/(1 + x))

= limx→∞

exp(−x)x2 = 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

sigmoid(0.7*x)ratio(x)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

-10 -5 0 5 10

0.7*dsigmoid(0.7*x)dratio(x)


Future Work

Ongoing:

investigate translational invariance, gradient starvation

results for hratio , hshunt models without second-order terms

other datasets, non-image data

Future: Deep Networks

system vs. cell

what neural (dendritic?) mechanisms underly Rust model?

try to stack many layers


Conclusions

• Machine Learning: new models

balance flexibility of higher-order unit, with efficiency of1st-order unit;

dramatic improvement on ShapeSet, competitive on MNIST

⇒ Brain models may yet inform Machine Learning.

• Comp.Neurosc.: alternative to sigmoid(∑

j Wijxj)

is computationally affordable,

biologically motivated,

easier learning.


image classification with neurally ... - iro.umontreal.ca

Documents