image classification with neurally ... - iro.umontreal.ca
TRANSCRIPT
Image Classification withNeurally Motivated Higher-Order Models
James Bergstra, Yoshua Bengio, Jerome Louradour
April 2008
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Summary
Neuroscience Activation Function
(Hubel & Wiesel, ***) → logistic sigmoid(b +∑
i Wixi )(Rust et ***, 2005) → this work
Models inspired from (Rust et al. 2005)
learn salient pairwise features of input;
performs well on image classification tasks;
are computationally cheap
motivate more accurate neural modelling
Q. Why do they work? When?A. We don’t know yet. Conjectures, future work
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Logistic Sigmoid
response = sigm(w · x + b) =1
1 + exp(−w · x − b)
approximation of V1 simple cell firing rate
computationally cheap
differentiable → learnable
two layer nets (possibly fat) are universal approximators
treat each xi independently → no XOR without layers
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
(Rust et. al, 2005) V1 model
Rust, N., Schwartz, O., Movshon, J. A., & Simoncelli, E. (2005).Spatiotemporal elements of macaque V1 receptive fields. Neuron,46, pp.945–956.
response(E ,S) = α +βE ζ − δSζ
1 + γE ζ + εSζ
E =√
max(0,w ′x)2 + x ′V ′Vx , S =√
x ′U ′Ux
separate positive scalar Excitation and Shunting inhibition
better approximation of V1 simple and complex cell
system-level account, not neuron model
low rank V ,U → computationally [almost as] cheap
can capture non-linear dependencies among neuron’s inputs
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Minsky and Papert’s HPU
hpu(x) = act(b +∑
i
Φi (xi )
+∑ij
Φij(xi , xj)
+∑ijk
Φijk(xi , xj , xk) · · · )
order > 1 → capable of XOR
computationally expensive O(inputsizeorder)
many parameters to learn for higher-orders
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Our quadratic models
hshunt,i (x) =Ei (x)− Si (x)
1 + Ei (x) + Si (x)
hratio,i (x) =Ei (x)
1 + Ei (x)
hquad ,i (x) = sigmoid(bi + w ′i x + x ′V ′i Vix − x ′U ′i Uix)
Ei (x) =√
x ′V ′i Vix + log[1 + exp(wi · x)]2, Si (x) =
√x ′U ′i Uix
log(1 + exp(z)): differentiable version of max(0, z)
Rust model exponents ζ fixed to 1 for learning
Rust model coefficients β, δ, γ, ε fixed to 1
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Different Squashing Functions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -5 0 5 10
sigmoid(0.7*x)ratio(x)
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0 5 10 15 20
log(sigmoid(0.7*x))log(ratio(x))
sigmoid(x) = 1/(1 + e−x)
ratio(x) = log(1 + ex)/(1 + log(1 + ex))
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Classifying Features
Linear output classifier:
f (x) = by + Wyh<model>(x)
Softmax for predicted class distribution:
p(class i | x) =efi (x)∑j efj (x)
Negative log likelihood training criterion:
Loss(x , target) = − log p(target | x)
Learn Wy , by , h<model> by gradient descent with early stoppingheuristic.
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Results: MNIST
Error Rates (%)Family Units K Valid Test
sigm1 - 1.8 1.9sigm3 - 2.1 2.4SVM - - - 1.4quad 80 ×8 1.8 1.9ratio 320 ×8 1.2 1.4shunt 40 ×8 1.4 1.5
Table: MNIST results for the best model in each family. Units is thehidden layer width (number of neurons). Validation-selected Kcomponents of higher-order interactions. Random guessing would score90% error.
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Datasets ShapeSet1, ShapeSet2
Figure: Sample inputs from ShapeSet1 (top) and ShapeSet2 (bottom).
Ten thousand images for training. (5K valid, 5K test)
Motivated by Baby-AI project.
Why not scale and centre? Doesn’t scale to complex scenes!
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Results: ShapeSet1
Error Rates (%)Family Units K Valid Test
SVM - - 29.6 29.3± 1.sigm 200 - 13.5 14.0± 1.sigm2 200 ×2 - 6.8 7.2± .8sigm3 200 ×3 - 5.4 5.9± .7quad 20 ×4 6.3 7.4± .8ratio 40 ×8 2.1 2.4 ±.4shunt 40 ×16 2.9 3.2 ±.5
Table: Performance on Shapeset1 for the best model in each family.Units is the hidden layer width (number of neurons). The best modelsused K components of higher-order interactions. Random guessing wouldscore 66.7% error. A 95% confidence interval is given on test errors.
hquad , hratio , hshunt make the most of a few unitsbest models always had quadratic interactions K > 0
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Results: ShapeSet2
Error Rates (%)Family Units K Valid Test
SVM - - 42.2 44± 2sigm 200 - 36.6 36± 1sigm2 500 ×2 - 24.0 24± 1sigm3 500 ×3 - 22.3 22± 1ratio 40 ×8 14.3 16 ± 1shunt 80 ×8 24.7 26 ± 1
Table: Performance on Shapeset2 for the best model in each family.Units is the hidden layer width (number of neurons). The best modelsused K components of higher-order interactions. Random guessing wouldscore 66.7% error. A 95% confidence interval is given on test errors.
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
2nd -order translational invariance
Family Units K Test Quadratic?SVM - - 29.3± 1. nosigm 200 - 14.0± 1. noquad 20 ×4 7.4± .8 yessigm2 200 ×2 - 7.2± .8 nosigm3 200 ×3 - 5.9± .7 noratio 40 ×8 2.4 ±.4 yesshunt 40 ×16 3.2 ±.5 yes
2nd -order modelscan be translationinvariant.
x_i
x_jUse the slope between features of thesubject (Reid et al,1989).
x : pixelized image
Wij = g(slope between xi , xj)
f (x) = a +∑
ij Wijxixj
f (x) non-trivial, independent of translation (not including bordereffects, cropping). f generalizes over subject position!
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Gradual saturationFamily Units K Test ratio?SVM - - 29.3± 1. nosigm 200 - 14.0± 1. noquad 20 ×4 7.4± .8 nosigm2 200 ×2 - 7.2± .8 nosigm3 200 ×3 - 5.9± .7 noratio 40 ×8 2.4 ±.4 yesshunt 40 ×16 3.2 ±.5 yes
Ratio-based unitseasier to optimizeby gradient?
logistic gradient:ddx (1/(1 + e−x)) = e−x
(1+e−x )2
ratio gradient:ddx (x/(1 + x)) = (1− x)−2
limx→∞
ddx (1/(1 + exp(−x)))
ddx (x/(1 + x))
= limx→∞
exp(−x)x2 = 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -5 0 5 10
sigmoid(0.7*x)ratio(x)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
-10 -5 0 5 10
0.7*dsigmoid(0.7*x)dratio(x)
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Future Work
Ongoing:
investigate translational invariance, gradient starvation
results for hratio , hshunt models without second-order terms
other datasets, non-image data
Future: Deep Networks
system vs. cell
what neural (dendritic?) mechanisms underly Rust model?
try to stack many layers
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Conclusions
• Machine Learning: new models
balance flexibility of higher-order unit, with efficiency of1st-order unit;
dramatic improvement on ShapeSet, competitive on MNIST
⇒ Brain models may yet inform Machine Learning.
• Comp.Neurosc.: alternative to sigmoid(∑
j Wijxj)
is computationally affordable,
biologically motivated,
easier learning.
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models
Conclusions
• Machine Learning: new models
balance flexibility of higher-order unit, with efficiency of1st-order unit;
dramatic improvement on ShapeSet, competitive on MNIST
⇒ Brain models may yet inform Machine Learning.
• Comp.Neurosc.: alternative to sigmoid(∑
j Wijxj)
is computationally affordable,
biologically motivated,
easier learning.
James Bergstra, Yoshua Bengio, Jerome Louradour Image Classification with . . . Higher-Order Models