hopfield networks and boltzmann machines - ut...• hopfield net tries reduce the energy at each...

32
Hopfield networks and Boltzmann machines Geoffrey Hinton et al. Presented by Tambet Matiisen 18.11.2014

Upload: others

Post on 14-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Hopfield networks andBoltzmann machines

Geoffrey Hinton et al.Presented by Tambet Matiisen

18.11.2014

Page 2: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Hopfield network

Binary unitsSymmetrical connections

http://www.nnwj.de/hopfield-net.html

Page 3: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Energy function

• The global energy:

• The energy gap:

• Update rule:

ji

ijjiii

i wssbsE

j

ijjiiii wsbsEsEE )1()0(

otherwise,0

0if,1

i

jijjii

s

wsbs

http://en.wikipedia.org/wiki/Hopfield_network

Page 4: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

- E = goodness = 4

Example

3 2 3 3

-1

-4

-1

1

1

0

0

?

0

- E = goodness = 3

? ?1

Page 5: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Deeper energy minimum

3 2 3 3

-1

-4

1-1

1

1

0

0

- E = goodness = 5

Page 6: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Is updating of Hopfield networkdeterministic or non-deterministic?

A. DeterministicB. Non-deterministic

Determinist

ic

Non-determinist

ic

0%0%

Page 7: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

How to update?

• Nodes must be updated sequentially, usuallyin randomized order.

• With parallel updating energy could go up.

• If updates occur in parallel but with randomtiming, the oscillations are usually destroyed.

-100

+5 +500

Page 8: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Content-addressable memory

• Using energy minima to represent memoriesgives a content-addressable memory.– An item can be accessed by just knowing part of

its content.– Can fill out missing or corrupted pieces of

information.– It is robust against hardware damage.

Page 9: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Classical conditioning

http://changecom.wordpress.com/2013/01/03/classical-conditioning/

Page 10: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Storing memories

• Energy landscape is determined by weights!

• If we use activities -1 and 1:

• If we use states 0 and 1:

∆wij = sis j

)()(42

1

2

1 jiij ssw

1 thenif

1 thenif

ijji

ijji

wss

wss

Page 11: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Demo

• http://www.tarkvaralabor.ee/doodler/• (choose Algorithm: Hopfield and Initialize)

Page 12: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

How many weights the example had?

A. 100B. 1000C. 10000

1001000

10000

0% 0%0%

Page 13: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Storage capacity

• The capacity of a totally connected net with Nunits is only about 0.15 * N memories.– With N bits per one memory this is only

0.15 * N * N bits.

• The net has N2 weights and biases.• After storing M memories, each connection

weight has an integer value in the range [–M, M].• So the number of bits required to store the

weights and biases is: N 2 log(2M +1)

Page 14: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

How many bits are needed torepresent weights in the example?

A. 1500B. 50 000C. 320 000

1500

50 000

320 000

0% 0%0%

Page 15: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Spurious minima

• Each time we memorize a configuration, wehope to create a new energy minimum.

• But what if two minima merge to create aminimum at an intermediate location?

Page 16: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Reverse learning

• Let the net settle from a random initial stateand then do unlearning.

• This will get rid of deep, spurious minima andincrease memory capacity.

Page 17: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Increasing memory capacity

• Instead of trying to store vectors in one shot,cycle through the training set many times.

• Use the perceptron convergence procedure totrain each unit to have the correct state giventhe states of all the other units in that vector.

ijijji wxfx jiiij xxxw )(

Page 18: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Hopfield nets with hidden units• Instead of using the net

to store memories, use itto constructinterpretations of sensoryinput.– The input is represented by

the visible units.– The interpretation is

represented by the statesof the hidden units.

– The badness of theinterpretation isrepresented by the energy.

visible units

hidden units

Page 19: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

3D edges from 2D images

You can only see one ofthese 3-D edges at atime because theyocclude one another.

2-D lines

3-D lines

picture

Page 20: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Noisy networks

• Hopfield net tries reduce the energy at each step.– This makes it impossible to escape from local minima.

• We can use random noise to escape from poorminima.– Start with a lot of noise so its easy to cross energy

barriers.– Slowly reduce the noise so that the system ends up in

a deep minimum. This is “simulated annealing”.

A B C

Page 21: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Temperature

AB

1.0)(

2.0)(

BAp

BAp

AB

000001.0)(

001.0)(

BAp

BAp

High temperaturetransitionprobabilities

Low temperaturetransitionprobabilities

Page 22: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Stochastic binary units

• Replace the binary threshold units by binary stochasticunits that make biased random decisions.

• The “temperature” controls the amount of noise.• Raising the noise level is equivalent to decreasing all

the energy gaps between configurations.

p(si=1) = 1

1+ e−∆Ei T temperature

Page 23: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Why we need stochastic binary units?

A. Because we cannotget rid of inherentnoise.

B. Because it helps toescape localminima.

C. Because we wantsystem to producerandomized results.

Because

we ca

nnot get r

i...

Because

it helps t

o escap...

Because

we w

ant syst

em ..

0% 0%0%

Page 24: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Thermal equilibrium

• Thermal equilibrium is a difficult concept!– Reaching thermal equilibrium does not mean that

the system has settled down into the lowestenergy configuration.

– The thing that settles down is the probabilitydistribution over configurations.

– This settles to the stationary distribution.– Any given system keeps changing its configuration,

but the fraction of systems in each configurationdoes not change.

Page 25: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Modeling binary data

• Given a training set of binary vectors, fit amodel that will assign a probability to everypossible binary vector.

• Model can be used for generating data withthe same distribution as original data.

• If particular model (distribution) produced theobserved data:

jj

iii modeldatap

modelpmodeldatapdatamodelp

)|(

)()|()|(

Page 26: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Boltzmann machine

• ...is defined in terms of the energies of jointconfigurations of the visible and hidden units.

• Probability of joint configuration:

• The probability of finding the network in thatjoint configuration after we have updated allof the stochastic binary units many times.

p(v,h)∝e−E(v,h)

Page 27: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Energy of a joint configuration

−E(v,h) = vibii∈vis∑ + hkbk

k∈hid∑ + viv jwij

i< j∑ + vihkwik

i,k∑ + hkhlwkl

k<l∑

bias ofunit k

weight betweenvisible unit i andhidden unit k

Energy with configuration von the visible units and hon the hidden units

binary stateof unit i in v

indexes every non-identical pair of iand j once

Page 28: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

From energies to probabilities• The probability of a joint

configuration over both visibleand hidden units depends onthe energy of that jointconfiguration compared withthe energy of all other jointconfigurations.

• The probability of aconfiguration of the visibleunits is the sum of theprobabilities of all the jointconfigurations that contain it.

p(v,h) = e−E (v,h)

e−E(u,g)

u,g∑

partitionfunction

p(v) =e−E(v,h)

h∑e−E(u,g)

u,g∑

Page 29: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Example

h1 h2

+2 +1

v1 v2

An example of how weights

define a distribution

1 1 1 1 2 7.39 .1861 1 1 0 2 7.39 .1861 1 0 1 1 2.72 .0691 1 0 0 0 1 .0251 0 1 1 1 2.72 .0691 0 1 0 2 7.39 .1861 0 0 1 0 1 .0251 0 0 0 0 1 .0250 1 1 1 0 1 .0250 1 1 0 0 1 .0250 1 0 1 1 2.72 .0690 1 0 0 0 1 .0250 0 1 1 -1 0.37 .0090 0 1 0 0 1 .0250 0 0 1 0 1 .0250 0 0 0 0 1 .025

39.70

v h −E e−E p(v, h ) p(v)

0.466

0.305

0.144

0.084

-1

Page 30: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Getting a sample from model• We cannot compute the normalizing term (the

partition function) because it has exponentially manyterms.

• So we use Markov Chain Monte Carlo to get samplesfrom the model starting from a random globalconfiguration:– Keep picking units at random and allowing them to

stochastically update their states based on their energygaps.

– Run the Markov chain until it reaches its stationarydistribution

• The probability of a global configuration is then relatedto its energy by the Boltzmann distribution.

Page 31: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

Getting a sample from the posteriordistribution for a given data vector

• The number of possible hidden configurations isexponential so we need MCMC to sample from theposterior.– It is just the same as getting a sample from the model,

except that we keep the visible units clamped to the givendata vector.

– Only the hidden units are allowed to change states• Samples from the posterior are required for learning

the weights. Each hidden configuration is an“explanation” of an observed visible configuration.Better explanations have lower energy.

Page 32: Hopfield networks and Boltzmann machines - ut...• Hopfield net tries reduce the energy at each step. – This makes it impossible to escape from local minima. • We can use random

What does Boltzmann machine reallydo?

A. Models probabilitydistribution of inputdata.

B. Generates samplesfrom modeleddistribution.

C. Learns probabilitydistribution of inputdata from samples.

D. All of above.Models p

robability dist

ri...

Generates s

amples from...

Learns p

robability dist

rib...

All of a

bove.

None of above

.

20% 20% 20%20%20%