can computer simulations of the brain allow us to see into the mind?

Can computer simulations of the brain allow us to see into the mind?

Geoffrey HintonCanadian Institute for Advanced Research

&University of Toronto

Overview

• Some old theories of how cortex learns and why they fail.

• Causal generative models and how to learn them.

• Energy-based generative models and how to learn them..– An example: Modeling a class of highly variable shapes

by using a set of learned features.

• A fast learning algorithm for deep networks that have many layers of neurons.– A really good generative model of handwritten digits.– How to see into the network’s mind.

How to make an intelligent system• The cortex has about a hundred billion neurons.

• Each neuron has thousands of connections.

• So all you need to do is find the right values for the weights on hundreds of thousands of billions of connections.

• This task is much too difficult for evolution to solve directly.– A blind search would be much too slow.– DNA doesn’t have enough capacity to store the answer.

• So evolution has found a learning algorithm and provided the right hardware environment for it to work in.– Searching the space of learning algorithms is a much

better bet than searching for weights directly.

A very simple learning task

• Consider a neural network with two layers of neurons.– neurons in the top layer

represent known shapes.– neurons in the bottom layer

represent pixel intensities.• A pixel gets to vote if it has ink

on it. – Each inked pixel can vote

for several different shapes. • The shape that gets the most

votes wins.

0 1 2 3 4 5 6 7 8 9

How to learn the weights (1960’s)

Show the network an image and increment the weights from active pixels to the correct class.

Then decrement the weights from active pixels to whatever class the network guesses.

The image

1 2 3 4 5 6 7 8 9 0



The image

1 2 3 4 5 6 7 8 9 0

The learned weights



The image

1 2 3 4 5 6 7 8 9 0

Why the simple system does not work

• A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape.– The winner is the template that has the biggest

overlap with the ink.

• The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes.– To capture all the allowable variations of a shape we

need to learn the features that it is composed of.

Examples of handwritten digits from a test set

Good Old-Fashioned Neural Networks (1980’s)

• The network is given an input vector and it must produce an output that represents:– a classification (e.g. the identity of a face)– or a prediction (e.g. the price of oil tomorrow)

• The network is made of multiple layers of non-linear neurons. – Each neuron sums its weighted inputs from the layer

below and non-linearly transforms this sum into an output that is sent to the layer above.

• The weights are learned by looking at a big set of labeled training examples.

Good old-fashioned neural networks

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

What is wrong with back-propagation?

• It requires labeled training data.– Almost all data is unlabeled.

• We need to fit about 10^14 connection weights in only about 10^9 seconds. – Unless the weights are highly redundant, labels cannot

possibly provide enough information.• The learning time does not scale well

– It is very slow in networks with more than two or three hidden layers.

• The neurons need to send two different types of signal– Forward pass: signal = activity = y– Backward pass: signal = dE/dy

Overcoming the limitations of back-propagation

• We need to keep the efficiency of using a gradient method for adjusting the weights, but use it for modeling the structure of the sensory input.– Adjust the weights to maximize the probability that a

generative model would have produced the sensory input. This is the only place to get 10^5 bits per second.

– Learn p(image) not p(label | image)

• What kind of generative model could the brain be using?

The building blocks: Binary stochastic neurons

• y is the probability of producing a spike.

iji

iwyinputexternaljneurontoinputtotal

0.5

00

1

jy

synaptic weight from i to j

output of neuron i

Sigmoid Belief Nets• It is easy to generate an

unbiased example at the leaf nodes.

• It is typically hard to compute the posterior distribution over all possible configurations of hidden causes.

• Given samples from the posterior, it is easy to learn the local interactions

Hidden cause

Visible effect

5

Explaining away

• Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. – If we learn that there was an earthquake it reduces the

probability that the house jumped because of a truck.

truck hits house earthquake

house jumps

20 20

-20

-10 -10

The wake-sleep algorithm

• Wake phase: Use the recognition weights to perform a bottom-up pass. – Train the generative weights

to reconstruct activities in each layer from the layer above.

• Sleep phase: Use the generative weights to generate samples from the model. – Train the recognition weights

to reconstruct activities in each layer from the layer below.

h2

data

h1

h3

2W

1W1R

2R

3W3R

How good is the wake-sleep algorithm?

• It solves the problem of where to get target values for learning– The wake phase provides targets for learning the

generative connections– The sleep phase provides targets for learning the

recognition connections (because the network knows how the fantasy data was generated)

• It only requires neurons to send one kind of signal.

• It approximates the true posterior by assuming independence. – This ignores explaining away which causes problems.

Two types of generative neural network

• If we connect binary stochastic neurons in a directed acyclic graph we get Sigmoid Belief Nets (Neal 1992).

• If we connect binary stochastic neurons using symmetric connections we get a Boltzmann Machine (Hinton & Sejnowski, 1983)

How a Boltzmann Machine models data

• It is not a causal generative model (like a sigmoid belief net) in which we first generate the hidden states and then generate the visible states given the hidden ones.

• To generate a sample from the model, we just keep stochastically updating the binary states of all the units– After a while, the probability

of observing any particular vector on the visible units will have reached its equilibrium value.

hidden units

visible units

8

Restricted Boltzmann Machines

• We restrict the connectivity to make learning easier.– Only one layer of hidden units.– No connections between hidden

units.• In an RBM, the hidden units really

are conditionally independent given the visible states. It only takes one step to reach conditional equilibrium distribution when the visible units are clamped. – So we can quickly get an

unbiased sample from the posterior distribution when given a data-vector :

hidden

i

j

visible

Weights Energies Probabilities

• Each possible joint configuration of the visible

and hidden units has an energy– The energy is determined by the weights and

biases.• The energy of a joint configuration of the visible

and hidden units determines its probability.• The probability of a configuration over the visible

units is found by summing the probabilities of all the joint configurations that contain it.

The Energy of a joint configuration

ji

ijjij

jji

ii whvbhbvv,hE,

)(

biases of units i and j

weight between units i and jEnergy with configuration

v on the visible units and h on the hidden units

binary state of visible unit i

indexes every connected visible-hidden pair

binary state of hidden unit j

Using energies to define probabilities

• The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.

• The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.

gu

guE

hvE

e

ehvp

,

),(

),(

),(

gu

guEh

hvE

e

e

vp

,

),(

),(

)(

partition function

A picture of the maximum likelihood learning algorithm for an RBM

0 jihv1 jihv

jihv

i

j

i

j

i

j

i

j

t = 0 t = 1 t = 2 t = infinity

jiji

ijhvhv

w

vp 0)(log

Start with a training vector on the visible units.

Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

a fantasy

Contrastive divergence learning: A quick way to learn an RBM

0 jihv1 jihv

i

j

i

j

t = 0 t = 1

)( 10 jijiij hvhvw


Update all the hidden units in parallel

Update the all the visible units in parallel to get a “reconstruction”.

Update the hidden units again.

This is not following the gradient of the log likelihood. But it works well.

It is approximately following the gradient of another objective function.

reconstructiondata

16

How to learn a set of features that are good for reconstructing images of the digit 2

50 binary feature neurons

16 x 16 pixel

image

50 binary feature neurons

16 x 16 pixel

image

Increment weights between an active pixel and an active feature

Decrement weights between an active pixel and an active feature

data (reality)

reconstruction (lower energy than reality)

Bartlett

The weights of the 50 feature detectors

We start with small random weights to break symmetry

The final 50 x 256 weights

Each neuron grabs a different feature.

data reconstruction

feature

Reconstruction from activated binary featuresData

Reconstruction from activated binary featuresData

How well can we reconstruct the digit images from the binary feature activations?

New test images from the digit class that the model was trained on

Images from an unfamiliar digit class (the network tries to see every image as a 2)

Training a deep network

• First train a layer of features that receive input directly from the pixels.

• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.

• It can be proved that each time we add another layer of features we get a better model of the set of training images.– i.e. we assign lower energy to the real data and

higher energy to all other possible images.– The proof is complicated. It uses variational free

energy, a method that physicists use for analyzing complicated non-equilibrium systems.

– But it is based on a neat equivalence.

• The variables in h1 are exactly conditionally independent given v.– Inference is trivial. We just

multiply v by – This is because the explaining

away is eliminated by the layers above h1.

• Inference in the causal network is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data.

A causal network that is equivalent to an RBM

W

h2

v

h1

h3

W

W

etc.

TW W

• Learning the weights in an RBM is exactly equivalent to learning in an infinite causal network with tied weights.

A causal network that is equivalent to an RBM

v

h

W

W

h2

v

h1

h3

W

W

etc.

W

• First learn with all the weights tied

Learning a deep causal network

v

h1

1W

h2

v

h1

h3

etc.

1W

1W

1W

1W

• Then freeze the bottom layer and relearn all the other layers.

h1

h2

2W

h2

v

h1

h3

etc.

1W

2W

2W

2W

• Then freeze the bottom two layers and relearn all the other layers.

h2

h3

3W

h2

v

h1

h3

etc.

1W

2W

3W

3W

The generative model after learning 3 layers

• To generate data:

1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling.

2. Perform a top-down pass to get states for all the other layers.

So the lower level bottom-up connections are not part of the generative model

h2

data

h1

h3

2W

3W

1W

A neural model of digit recognition

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel image

10 label

neurons

The model learns to generate combinations of labels and images.

To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by one or two iterations of the top-level associative memory.

The top two layers form an associative memory whose energy landscape models the low dimensional manifolds of the digits.

The energy valleys have names

Fine-tuning with the up-down algorithm: A contrastive divergence version of wake-sleep

• Replace the top layer of the causal network by an RBM– This eliminates explaining away at the top-level.– It is nice to have an associative memory at the top.

• Replace the sleep phase by a top-down pass starting with the state of the RBM produced by the wake phase.– This makes sure the recognition weights are trained in

the vicinity of the data.– It also reduces mode averaging. If the recognition

weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much.

Examples of correctly recognized handwritten digitsthat the neural network had never seen before

Its very good

How well does it discriminate on MNIST test set with no extra information about geometric distortions?

• Up-down net with RBM pre-training + CD10 1.25%• Support Vector Machine (Decoste et. al.) 1.4% • Backprop with 1000 hiddens (Platt) 1.5%• Backprop with 500 -->300 hiddens 1.5%• K-Nearest Neighbor ~ 3.3%

• Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.

The features learned in the first hidden layer

Seeing what it is thinking

• The top level associative memory has activities over thousands of neurons.– It is hard to tell what the network is

thinking by looking at the patterns of activation.

• To see what it is thinking, convert the top-level representation into an image by using top-down connections.– A mental state is the state of a

hypothetical world in which the internal representation is correct.

The extra activation of cortex caused by a speech task. What were they thinking?

brain state

What goes on in its mind if we show it an image composed of random pixels and ask it

to fantasize from there?

mind brain

mind brain

mind brain

2000 top-level neurons

500 neurons

500 neurons

28 x 28 pixel

image

10 label

neurons

Samples generated by running the associative memory with one label clamped. Initialized by an up-pass from a random binary image. 20 iterations between samples.

Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling

between samples.

Learning with realistic labels

This network treats the labels in a special way, but they could easily be replaced by an auditory pathway.

2000 top-level units

500 units

500 units

28 x 28 pixel

image

10 label units

Learning with auditory labels

• Alex Kaganov replaced the class labels by binarized cepstral spectrograms of many different male speakers saying digits.

• The auditory pathway then had multiple layers, just like the visual pathway. The auditory and visual inputs shared the top level layer.

• After learning, he showed it a visually ambiguous digit and then reconstructed the visual input from the representation that the top-level associative memory had settled on after 10 iterations.

“six” “five”

reconstruction original visual input reconstruction

And now for something a bit more realistic

• Handwritten digits are convenient for research into shape recognition, but natural images of outdoor scenes are much more complicated.– If we train a network on patches from natural

images, does it produces sets of features that look like the ones found in real brains?

A network with local connectivity

image

Global connectivity

Local connectivityThe local connectivity between the two hidden layers induces a topography on the hidden units.

Features learned by a net that sees 100,000 patches of natural images.

The feature neurons are locally connected to each other.

THE END

All 125 errors

Best other machine learning technique gets 140 errors on this task.

(these results are without added distortions or geometric prior knowledge)

The receptive fields of the first hidden layer

The generative fields of the first hidden layer

A simple learning algorithm

0 jiss1 jiss

i

j

i

j

data reconstruction

)( reconji

datajiij ssssw


Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

“visible neurons” represent pixels

“hidden” neurons represent features

learning rate

Types of connectivity

• Feedforward networks– These compute a series of

transformations– Typically, the first layer is

the input and the last layer is the output.

• Recurrent networks– These have directed cycles

in their connection graph. They can have complicated dynamics.

– More biologically realistic.

hidden units

output units

input units

Binary stochastic neurons

• y is the probability of producing a spike.

iji

iwyinputexternaljtoinputtotal

0.5

00

1

jy

synaptic weight from i to j

output of neuron i

A simple learning algorithm

i

j

i

j

data reconstruction

1. Start with a training image on the visible neurons.

2. Pick binary states for the hidden neurons. Increment weights between active pairs.

3. Then pick binary states for the visible neurons

4. Then pick binary states for the hidden neurons again. Decrement weights between active pairs.

This changes the weights to increase the goodness of the data and

decrease the goodness of the reconstruction.

“visible neurons” represent pixels

“hidden” neurons represent features

A more interesting type of network

• Several politicians can be elected at the same time.– A politician corresponds to a feature and each familiar

object corresponds to a coalition of features.• The politicians decide who can vote (like in Florida).

• The whole system of voters and politicians can have several different stable states.– Conservative politicians discourage liberal voters from

voting.– Liberal politicians discourage conservative voters from

voting.• If we add some noise to the voting process, the system will

occasionally jump from one regime to another.

Stochastic binary neurons

• These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons.

0.5

00

1

jjiji

i wsbsp

)exp(1)( 11

j

jiji wsb

)( 1isp

3

• Each neuron receives inputs from thousands of other neurons– A few neurons also get inputs from the sensory receptors- A few neurons send outputs to muscles.- Neurons use binary spikes of activity to communicate

• The effect that one neuron has on another is controlled by a synaptic weight– The weights can be

positive or negative

• The synaptic weights adapt so that the whole network learns to perform useful computations– Recognizing objects, understanding language, making

plans, controlling the body

How the brain works

can computer simulations of the brain allow us to see into the mind?

Documents

active pixels

network guesses

correct class

layer network

neural network

layers of neurons

fast learning algorithm

simple learning taskconsider