herding: the nonlinear dynamics of learning max welling scivi lab - ucirvine

Herding:The Nonlinear Dynamics of Learning

Max WellingSCIVI LAB - UCIrvine

Yes, All Models Are Wrong, but……from a CS/ML perspective this may not necessarily be less of big problem.

• Training: We want to gain an optimal amount of predictive accuracy per unit time.

• Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision.

• Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well).

• Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC).

Fight or flight

Not Bayesian Nor Frequentist

But Mandelbrotist…

Is there a deep connection between learning, computation and chaos theory?

Perspective

xn ~ p(x) model/inductive bias

f p(x) f (x) 1

N f (xn )

n

Pseudo-samples

Inte

grati

onpr

edic

tion

learning

inference

Integrationprediction

herding

Herding

PfE ˆ][

Nonlinear Dynamical System.Nonlinear Dynamical System.Generate pseudo-samples “S”.Generate pseudo-samples “S”.

PherdingfESf ˆ][)(

herdingSg )(

prediction consistency

Herding

)(][

)(maxarg

ˆ SffEWW

SfWS

kkPkk

kkk

S

• weights to not converge, Monte Carlo sums do• Maximization does not have to be perfect (see PCT theorem).• Deterministic• No step-size• Only very simple operations (no exponentiation, logarithms etc.)

Ising/Hopfield Model Network

wk fk (S)k

wijij

sis j wii

si

si* wijs j wi

j

*ˆ

**ˆ

][

][

iipii

jijipijij

ssEWW

ssssEWW

Neuron fires if input exceeds threshold

Synapse depresses ifpre- & postsynapticneurons fire.

Threshold depresses after neuron fires

Pseudo-Samples From Critical Ising Model

Herding as a Dynamical System

)(][)( ˆ1, tkkPktkttk SffEWWFW

S

w

data

kkkt

Sktt SfWWS )(maxarg)(

constant

Piecewise constant fct. of W

kkkt

Stt SfWSSSGS )(maxarg),...,,( 121

Markov process in W

1

1ˆ0 )(][

t

iikkPkkt SffEWW Infinite memory process in S

Example in 2-D

s=[1,1,2,5,2...Itinerary:

s=1

s=2

s=3s=4

s=5

s=6

r W t1

r W t E ˆ P

[r f ]

r f (St )

Convergence

Translation:

Choose St such that:

Then:

v t E ˆ P [ f ] fk (St )

Wkk k

Wk E ˆ P [ fk ] fk (S)

k

0

)1

(|~][)(1

| ˆ1 T

OfEsfT kP

T

ttk

s=1

s=2

s=3s=4

s=5

s=6

s=[1,1,2,5,2...

Equivalent to “Perceptron Cycling Theorem”(Minsky ’68)

Period Doubling

W t1 RW t (1 W t )

As we change R (T) the number of fixed points change.

Wk,t1 Wk,t fk fk (x)exp

Wk ',t

Tk '

fk (x)

x

expWk ',t

Tk '

fk (x)

x

T=0: herding

“edge of chaos”

Applications

• Classification

• Compression

• Modeling Default Swaps

• Monte Carlo Integration

• Image Segmentation

• Natural Language Processing

• Social Networks

ExampleClassifier from local Image features:

P(Object Category | Local Image Information)

Classifier from boundary detection:

P(Object Categories are Different across Boundary | Boundary Information)

+

Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope)

Combine with

Herding

Topological EntropyTheorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is:

However, we are interested in the sub-extensive entropy: [Nemenman et al.]

hsubtop limT

logW (T)

log(T) lim

T

w log(T)

log(T) w

Theorem:

Conjecture:

htop limT

logW (T)

T lim

T

w log(T)

T 0

hsubtop K

hsubtop K

(K = nr. of parameters)

(for typical herding systems)

S=1,3,2

Learning Systems

)log(2

~)]|([ ~)(log NK

XpHtermsextensiveXPEvidenceBayesian

Herding is not random and not IID due to negative auto-correlations. The information in its sequence is: .

We can therefore represent the original (random) data sample by a much smallersubset without loss of information content (N instead of N2 samples).

These shorter herding sequences can be used to efficiently approximate averages byMonte Carlo sums.

)log(NK

Information we learn from the random IID data.

Conclusions

• Herding is an efficient alternative for learning in MRFs.

• Edge of chaos dynamics provides more efficient information processing than random sampling.

• General principle that underlies information processing in the brain ?

• We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?

herding: the nonlinear dynamics of learning max welling scivi lab - ucirvine

Documents

herding slide

flight slide

herding edge of chaos

typical herding systems

prediction consistency

neuron fires slide

herding weights

conclusions herding