Efficient Training in high-dimensional weight space
Theoretische Physik und Astrophysik
Computational Physics
Julius-Maximilians-Universität Würzburg
Am Hubland, D-97074 Würzburg, Germany
http://theorie.physik.uni-wuerzburg.de/~biehl
Wiskunde & Informatica
Intelligent Systems
Rijksuniversiteit Groningen, Postbus 800,
NL-9718 DD Groningen, The Netherlands
[email protected], www.cs.rug.nl/~biehl
, Michael Biehl Michael BiehlChristoph Bunzmann,Robert Urbanczik
Efficient training in high-dimensional weight space
Learning from examples
A model situation layered neural networks student teacher scenario
The dynamics of on-line learning on-line gradient descent delayed learning, plateau states
Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results
Summary, Outlook selected further topics prospective projects
Learning from examples
choice of adjustable parameters in
adaptive information processing systems
· based on example data, e.g. input/output pairs in
classification tasks
time series prediction
regression problems
supervised
learning
· parameterizes a hypothesis
e.g. for an unknown classification or regression task
· guided by the optimization of an appropriate
objective or cost function
e.g. performance with respect to the example data
· results in generalization ability
e.g. the successful classification of novel data
Theory of learning processes
· description of specific e.g. hand written digit recognition applications
- particular training scheme
- given real world problem
- special set of example data ...
· typical properties of e.g. learning curves
model scenarios - network architecture- statistics of data, noise
understanding/prediction of relevant phenomena, algorithm design
trade off: general validity / applicability
- learning algorithm
· general results
- statistical properties of data - specific task
- details of training procedure ...
independent of
e.g. performance bounds
NRI ξ
input data
sigmoidal hidden
activation, e.g. g(x) = erf (a x)
ξ w x
A two-layered network: the soft committee machine
K
1k k ξw g σ
input/output relation RIRI N
( fixed hidden to output weights )
Nk RIw
adaptive weights
ξ w g σ kk
hidden units K21k ...,,
SCM+ adaptive thresholds:universal approximator
M K ideal situation: perfectly matching complexity
Student teacher scenario
M K unlearnable rule
M K over-sophisticated student
interesting effects
relevant cases
K
1k k ξw g ξ σ
adaptive student
hidden unitsK
M
1m m ξw g ξ τ
*
teacher
M
(best) rule parameterization
? ? ? ? ? ? ?
5
training based on the performance w.r.t. example data, e.g.
2μμP
1μ
μP
1μ )ξ ( τ - )ξ ( σ
21
P1
ξ e P1
E
input/output pairs: P
1μμμ )ξ ( τ ,ξ DI
examples for the unknown function or rule ) ξ ( τ
(reliable)
evaluation after training
generalization error ξ
ξ e e G
expected error for a novel input DI τ ,ξ
w.r.t. density of inputs / set of test inputs
Statistical Physics approach
· consider large systems, in the thermodynamic limit N (K,M«N)
dimension of input data
number of adjustable parametersN
· perform averages
over stochastic training process T
over randomized example data, quenched disorder DI
(technically) simplest case: reliable teacher outputs,
isotropic input density: independent components
with zero mean / unit variance
· evaluate typical properties
e.g. the learning curve P vs. e IDT G
· description in terms of macroscopic quantities
e.g. overlap parameters
student/teacher similarity measure
jiij*mjjm w w Q ,w w R
next: eg
The generalization error
)x ( g - )x ( g ) ξ ( e e 2 *
mM
1mkK
1kG 2K1
ξ w x , ξ w x *m
*mk k
(sums of many random numbers)
Central Limit Theorem: correlated Gaussians for large N
0 x x *m k jkkj k j Q w w xx
mn*n
*m
*n
*m jm
*mj
*m j δ w w xx R w w xx
first and second
moments:
jk jm,GkG QR e ) ξ ( e w e
averages over integrals overξ *
m k x,x
K N
microscopic
macroscopic
½ (K2+K) + K M
Dynamics of on-line gradient descent
presentation of single examples
weights after presentation of: w 1-μ1-μ examples
On-line learning step: e η/N w w1-μ
μ1-μ μ
novel, random example: , )ξ ( τ ,ξ μμ 2
μμμ )ξ ( τ - )ξ ( σ 21
e
number of examples discrete learning time
· no explicit storage of all examples ID required
· little computational effort per examplepractical advantages:
mathematical ease: typical dynamics of learning can be evaluated on
average over a randomized sequence of examples
coupled ODEs for {Rjm,Qij} in time =P/(KN)
projections *m
μkkm
μk
μjjk ww ) μ (Rww ) μ (Q
recursions, e.g. μ*m
μk
μμkmkm x(xg'τσ η N / 1
) 1-μ (R - ) μ (R)
large N • average over latest example Gaussian ξμ *μ
m μk xx ,
• mean recursions coupled ODE in continuous time
N K μ
α training time
~ examples per weight
learning curve ) α ( e G ), α (R ), α (Q kmjk
100 200 3000
0.01
0.02
0.03
0.04
0.05
0
eG
= P/(KN)
Biehl, Riegler, WöhlerJ.Phys. A (1996) 4769
perfect generalization
0 e G
fast initial decrease
example: K = M = 2, = 1.5, Rij(0) 0
quasi-stationary plateau states with all
dominate the learning process
R Rij
w wj ijij orth. *
unspecialized student weights
10
learning curve
ah
a!
example: K = M = 2, Tmn = mn, = 1, Rij(0) 0,
100 200 3000
0.5
1.0
0.0
R11, R22 Q11, Q22
R12, R21 Q21= Q21
1w
2w
* 1w
* 2w
permutation symmetry of branches in the student network
evolution of overlap parameters
N
Qjm
mean
s tan
dard
devi a
t ion
quantity
Monte Carlo simulations self-averaging
1/N
1/N
Plateau length
platα if all
assume randomized initialization of weight vectors
NlnN)( P, K N,lnα N1 O(0)R plat jk examples needed for successful learning !
KN P for RR jmjj hidden unit specialization
requires a priori knowledge (initial macroscopic overlaps)
property of the learning scenario
necessary phase of training
or
artifact of the training prescription
???
R(0)R jk exactly
self-avg.
S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications
t testE
Training by Principal Component Analysis
problem: delayed specialization in ( K N ) dimensional weight space
idea:
A) identification (approximation) of the subspace of
B) actual training within this low-dimensional space
* mw
Σλ 1 eigenvector
M
1m
* mΣ ww
λ ( K-1 ) e.v. ) M2,3,m (*
m* 1m wwΔ
example: soft committee teacher (K=M), isotropic input density
modified correlation matrix ji2
ijT2 ξ ξ)ξ(τC ,ξ ξ)ξ(τC
eigenvalues and eigenvectors: ΔoΣ λλ λ
( N-K ) e.v. * mw u
oλ
empirical estimate from a limited data set
P
1μ
μ
j
μ
i μ2P
ij ξ ξ )ξ (τ P1
C
· optimization of w.r.t. E kja ( K2 K N coefficients)
( # of examples P = NK K2 )
note: required memory N2 does not increase with P
) K1,2,j (Pk
K
1k
kjj Δaw
· representation of student weights
B) specialization in the K - dimensional space of PkΔ
· determinelargest eigenvalue, e.v.
(K-1) smallest eigenvalues, e.v. K),2,(kPkΔ
P1
PΣ Δw
1
DItypical properties: given a random set of P = N K examples
formal partition sum
quenched free energy
D)(IC P β -exp d Z PT N
ID Z ln ~ replica trick
saddle point integration limit
typical overlap with teacher weightsIDT
2 *ii
2 w ( ρ )
measures the success of teacher space identification A)
B) given , determine the optimal eG
achievable by a linear combination of
Δ i
K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)
B)
P = K N examples
c (K=2) = 4.49
c (K=3) = 8.70
large K theory:
c (K) ~ 2.94 K (N-indep.!)
A)
c
B) given , determine the optimal eG
achievable by a linear combination of
Δ i
K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)
c
P = K N examples
c (K=2) = 4.49
c (K=3) = 8.70
large K theory:
c (K) ~ 2.94 K (N-indep.!)
A)
B)
Bunzmann, Biehl, UrbanczikPhys. Rev. Lett. 86, 2166 (2001)
unspecialized
specialized
specialization without
a priori knowledge
( c independent of N )
15
spectrum of matrix CP, teacher with M = 7 hidden units
K-1 = 6 smallest eigenvalues
algorithm requires no
prior knowledge of M
PCA hints at the required
model complexity
potential application: model selection
· model situation, supervised learning
- the soft committee machine
- student teacher scenario
- randomized training data
· dynamics of on-line gradient descent
- delayed learning due to symmetry breaking
necessary specialization processes
· statistical physics inspired approach
- large systems
- thermal (training) and disorder (data) average
- typical, macroscopic properties
Summary
· efficient training
- PCA based learning algorithm
reduces dimensionality of the problem
- specialization without a priori knowledge
Further topics · perceptron training (single layer) optimal stability classification dynamics of learning
· unsupervised learning principal component analysis competitive learning, clustered data
· specialization processes discontinuous learning curves delayed learning, plateau states
· dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks
· algorithm design variational method, optimal algorithms construction algorithm
· non-trivial statistics of data learning from noisy data time-dependent rules
· unsupervised learning
density estimation, feature detection,
clustering, (Learning) Vector Quantization
compression, self-organizing maps
· application relevant architectures and algorithms
Local Linear Model Trees
Learning Vector Quantization
Support Vector Machines
Selected Prospective Projects
· model selection
estimate complexity of a rule
or mixture density
· algorithm design
variational optimization, e.g.
alternative correlation matrix μ
j
μ
i μ
ij ξ ξ )ξ τ(FC