hopfield networks and boltzmann machines - ut...• hopfield net tries reduce the energy at each...

Hopfield networks andBoltzmann machines

Geoffrey Hinton et al.Presented by Tambet Matiisen

18.11.2014

Hopfield network

Binary unitsSymmetrical connections

http://www.nnwj.de/hopfield-net.html

Energy function

• The global energy:

• The energy gap:

• Update rule:

ijjiii

i wssbsE

ijjiiii wsbsEsEE )1()0(

otherwise,0

jijjii

http://en.wikipedia.org/wiki/Hopfield_network

- E = goodness = 4

Example

3 2 3 3

- E = goodness = 3

Deeper energy minimum

3 2 3 3

- E = goodness = 5

Is updating of Hopfield networkdeterministic or non-deterministic?

A. DeterministicB. Non-deterministic

Determinist

Non-determinist

How to update?

• Nodes must be updated sequentially, usuallyin randomized order.

• With parallel updating energy could go up.

• If updates occur in parallel but with randomtiming, the oscillations are usually destroyed.

+5 +500

Content-addressable memory

• Using energy minima to represent memoriesgives a content-addressable memory.– An item can be accessed by just knowing part of

its content.– Can fill out missing or corrupted pieces of

information.– It is robust against hardware damage.

Classical conditioning

http://changecom.wordpress.com/2013/01/03/classical-conditioning/

Storing memories

• Energy landscape is determined by weights!

• If we use activities -1 and 1:

• If we use states 0 and 1:

∆wij = sis j

)()(42

1 jiij ssw

1 thenif

• http://www.tarkvaralabor.ee/doodler/• (choose Algorithm: Hopfield and Initialize)

How many weights the example had?

A. 100B. 1000C. 10000

1001000

0% 0%0%

Storage capacity

• The capacity of a totally connected net with Nunits is only about 0.15 * N memories.– With N bits per one memory this is only

0.15 * N * N bits.

• The net has N2 weights and biases.• After storing M memories, each connection

weight has an integer value in the range [–M, M].• So the number of bits required to store the

weights and biases is: N 2 log(2M +1)

How many bits are needed torepresent weights in the example?

A. 1500B. 50 000C. 320 000

50 000

320 000

0% 0%0%

Spurious minima

• Each time we memorize a configuration, wehope to create a new energy minimum.

• But what if two minima merge to create aminimum at an intermediate location?

Reverse learning

• Let the net settle from a random initial stateand then do unlearning.

• This will get rid of deep, spurious minima andincrease memory capacity.

Increasing memory capacity

• Instead of trying to store vectors in one shot,cycle through the training set many times.

• Use the perceptron convergence procedure totrain each unit to have the correct state giventhe states of all the other units in that vector.

ijijji wxfx jiiij xxxw )(

Hopfield nets with hidden units• Instead of using the net

to store memories, use itto constructinterpretations of sensoryinput.– The input is represented by

the visible units.– The interpretation is

represented by the statesof the hidden units.

– The badness of theinterpretation isrepresented by the energy.

visible units

hidden units

3D edges from 2D images

You can only see one ofthese 3-D edges at atime because theyocclude one another.

2-D lines

3-D lines

picture

Noisy networks

• Hopfield net tries reduce the energy at each step.– This makes it impossible to escape from local minima.

• We can use random noise to escape from poorminima.– Start with a lot of noise so its easy to cross energy

barriers.– Slowly reduce the noise so that the system ends up in

a deep minimum. This is “simulated annealing”.

Temperature

000001.0)(

001.0)(

High temperaturetransitionprobabilities

Low temperaturetransitionprobabilities

Stochastic binary units

• Replace the binary threshold units by binary stochasticunits that make biased random decisions.

• The “temperature” controls the amount of noise.• Raising the noise level is equivalent to decreasing all

the energy gaps between configurations.

p(si=1) = 1

1+ e−∆Ei T temperature

Why we need stochastic binary units?

A. Because we cannotget rid of inherentnoise.

B. Because it helps toescape localminima.

C. Because we wantsystem to producerandomized results.

Because

nnot get r

Because

it helps t

o escap...

Because

ant syst

0% 0%0%

Thermal equilibrium

• Thermal equilibrium is a difficult concept!– Reaching thermal equilibrium does not mean that

the system has settled down into the lowestenergy configuration.

– The thing that settles down is the probabilitydistribution over configurations.

– This settles to the stationary distribution.– Any given system keeps changing its configuration,

but the fraction of systems in each configurationdoes not change.

Modeling binary data

• Given a training set of binary vectors, fit amodel that will assign a probability to everypossible binary vector.

• Model can be used for generating data withthe same distribution as original data.

• If particular model (distribution) produced theobserved data:

iii modeldatap

modelpmodeldatapdatamodelp

)()|()|(

Boltzmann machine

• ...is defined in terms of the energies of jointconfigurations of the visible and hidden units.

• Probability of joint configuration:

• The probability of finding the network in thatjoint configuration after we have updated allof the stochastic binary units many times.

p(v,h)∝e−E(v,h)

Energy of a joint configuration

−E(v,h) = vibii∈vis∑ + hkbk

k∈hid∑ + viv jwij

i< j∑ + vihkwik

i,k∑ + hkhlwkl

k<l∑

bias ofunit k

weight betweenvisible unit i andhidden unit k

Energy with configuration von the visible units and hon the hidden units

binary stateof unit i in v

indexes every non-identical pair of iand j once

From energies to probabilities• The probability of a joint

configuration over both visibleand hidden units depends onthe energy of that jointconfiguration compared withthe energy of all other jointconfigurations.

• The probability of aconfiguration of the visibleunits is the sum of theprobabilities of all the jointconfigurations that contain it.

p(v,h) = e−E (v,h)

e−E(u,g)

u,g∑

partitionfunction

p(v) =e−E(v,h)

h∑e−E(u,g)

u,g∑

Example

An example of how weights

define a distribution

1 1 1 1 2 7.39 .1861 1 1 0 2 7.39 .1861 1 0 1 1 2.72 .0691 1 0 0 0 1 .0251 0 1 1 1 2.72 .0691 0 1 0 2 7.39 .1861 0 0 1 0 1 .0251 0 0 0 0 1 .0250 1 1 1 0 1 .0250 1 1 0 0 1 .0250 1 0 1 1 2.72 .0690 1 0 0 0 1 .0250 0 1 1 -1 0.37 .0090 0 1 0 0 1 .0250 0 0 1 0 1 .0250 0 0 0 0 1 .025

v h −E e−E p(v, h ) p(v)

Getting a sample from model• We cannot compute the normalizing term (the

partition function) because it has exponentially manyterms.

• So we use Markov Chain Monte Carlo to get samplesfrom the model starting from a random globalconfiguration:– Keep picking units at random and allowing them to

stochastically update their states based on their energygaps.

– Run the Markov chain until it reaches its stationarydistribution

• The probability of a global configuration is then relatedto its energy by the Boltzmann distribution.

Getting a sample from the posteriordistribution for a given data vector

• The number of possible hidden configurations isexponential so we need MCMC to sample from theposterior.– It is just the same as getting a sample from the model,

except that we keep the visible units clamped to the givendata vector.

– Only the hidden units are allowed to change states• Samples from the posterior are required for learning

the weights. Each hidden configuration is an“explanation” of an observed visible configuration.Better explanations have lower energy.

What does Boltzmann machine reallydo?

A. Models probabilitydistribution of inputdata.

B. Generates samplesfrom modeleddistribution.

C. Learns probabilitydistribution of inputdata from samples.

D. All of above.Models p

robability dist

Generates s

amples from...

Learns p

robability dist

rib...

All of a

None of above

20% 20% 20%20%20%

hopfield networks and boltzmann machines - ut...• hopfield net tries reduce the energy at each...

Documents

generalized hopfield networks and nonlinear...

artificial neural networks - hopfield nets

chapter v - metu...

modified imperialistic competitive algorithm in hopfield

cs 678 –boltzmann machines1 boltzmann machine relaxation...

hopfield neural network

hopfield networks

1 neural networks 3. 2 hopfield network (hn) model a...

dynamic channel allocation using hopfield networks

hopfield network

neural networks for machine learning lecture 11a hopfield...

neural networks for machine learning lecture 11a hopfield...

continuous hopfield

using a hopfield network: a nuts and bolts...

modified hopfield neural network classification …

lncs 5073 - hopfield neural network for sea surface ... ·...

image compression using hopfield neural network

minima - mouser.com

associative memories (i) hopfield...

hopfield rapor