can computer simulations of the brain allow us to see into the mind?
DESCRIPTION
Can computer simulations of the brain allow us to see into the mind?. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto. Overview. Some old theories of how cortex learns and why they fail. Causal generative models and how to learn them. - PowerPoint PPT PresentationTRANSCRIPT
Can computer simulations of the brain allow us to see into the mind?
Geoffrey HintonCanadian Institute for Advanced Research
&University of Toronto
Overview
• Some old theories of how cortex learns and why they fail.
• Causal generative models and how to learn them.
• Energy-based generative models and how to learn them..– An example: Modeling a class of highly variable shapes
by using a set of learned features.
• A fast learning algorithm for deep networks that have many layers of neurons.– A really good generative model of handwritten digits.– How to see into the network’s mind.
How to make an intelligent system• The cortex has about a hundred billion neurons.
• Each neuron has thousands of connections.
• So all you need to do is find the right values for the weights on hundreds of thousands of billions of connections.
• This task is much too difficult for evolution to solve directly.– A blind search would be much too slow.– DNA doesn’t have enough capacity to store the answer.
• So evolution has found a learning algorithm and provided the right hardware environment for it to work in.– Searching the space of learning algorithms is a much
better bet than searching for weights directly.
A very simple learning task
• Consider a neural network with two layers of neurons.– neurons in the top layer
represent known shapes.– neurons in the bottom layer
represent pixel intensities.• A pixel gets to vote if it has ink
on it. – Each inked pixel can vote
for several different shapes. • The shape that gets the most
votes wins.
0 1 2 3 4 5 6 7 8 9
How to learn the weights (1960’s)
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
The learned weights
Show the network an image and increment the weights from active pixels to the correct class.
Then decrement the weights from active pixels to whatever class the network guesses.
The image
1 2 3 4 5 6 7 8 9 0
Why the simple system does not work
• A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape.– The winner is the template that has the biggest
overlap with the ink.
• The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes.– To capture all the allowable variations of a shape we
need to learn the features that it is composed of.
Good Old-Fashioned Neural Networks (1980’s)
• The network is given an input vector and it must produce an output that represents:– a classification (e.g. the identity of a face)– or a prediction (e.g. the price of oil tomorrow)
• The network is made of multiple layers of non-linear neurons. – Each neuron sums its weighted inputs from the layer
below and non-linearly transforms this sum into an output that is sent to the layer above.
• The weights are learned by looking at a big set of labeled training examples.
Good old-fashioned neural networks
input vector
hidden layers
outputs
Back-propagate error signal to get derivatives for learning
Compare outputs with correct answer to get error signal
What is wrong with back-propagation?
• It requires labeled training data.– Almost all data is unlabeled.
• We need to fit about 10^14 connection weights in only about 10^9 seconds. – Unless the weights are highly redundant, labels cannot
possibly provide enough information.• The learning time does not scale well
– It is very slow in networks with more than two or three hidden layers.
• The neurons need to send two different types of signal– Forward pass: signal = activity = y– Backward pass: signal = dE/dy
Overcoming the limitations of back-propagation
• We need to keep the efficiency of using a gradient method for adjusting the weights, but use it for modeling the structure of the sensory input.– Adjust the weights to maximize the probability that a
generative model would have produced the sensory input. This is the only place to get 10^5 bits per second.
– Learn p(image) not p(label | image)
• What kind of generative model could the brain be using?
The building blocks: Binary stochastic neurons
• y is the probability of producing a spike.
iji
iwyinputexternaljneurontoinputtotal
0.5
00
1
jy
synaptic weight from i to j
output of neuron i
Sigmoid Belief Nets• It is easy to generate an
unbiased example at the leaf nodes.
• It is typically hard to compute the posterior distribution over all possible configurations of hidden causes.
• Given samples from the posterior, it is easy to learn the local interactions
Hidden cause
Visible effect
5
Explaining away
• Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. – If we learn that there was an earthquake it reduces the
probability that the house jumped because of a truck.
truck hits house earthquake
house jumps
20 20
-20
-10 -10
The wake-sleep algorithm
• Wake phase: Use the recognition weights to perform a bottom-up pass. – Train the generative weights
to reconstruct activities in each layer from the layer above.
• Sleep phase: Use the generative weights to generate samples from the model. – Train the recognition weights
to reconstruct activities in each layer from the layer below.
h2
data
h1
h3
2W
1W1R
2R
3W3R
How good is the wake-sleep algorithm?
• It solves the problem of where to get target values for learning– The wake phase provides targets for learning the
generative connections– The sleep phase provides targets for learning the
recognition connections (because the network knows how the fantasy data was generated)
• It only requires neurons to send one kind of signal.
• It approximates the true posterior by assuming independence. – This ignores explaining away which causes problems.
Two types of generative neural network
• If we connect binary stochastic neurons in a directed acyclic graph we get Sigmoid Belief Nets (Neal 1992).
• If we connect binary stochastic neurons using symmetric connections we get a Boltzmann Machine (Hinton & Sejnowski, 1983)
How a Boltzmann Machine models data
• It is not a causal generative model (like a sigmoid belief net) in which we first generate the hidden states and then generate the visible states given the hidden ones.
• To generate a sample from the model, we just keep stochastically updating the binary states of all the units– After a while, the probability
of observing any particular vector on the visible units will have reached its equilibrium value.
hidden units
visible units
8
Restricted Boltzmann Machines
• We restrict the connectivity to make learning easier.– Only one layer of hidden units.– No connections between hidden
units.• In an RBM, the hidden units really
are conditionally independent given the visible states. It only takes one step to reach conditional equilibrium distribution when the visible units are clamped. – So we can quickly get an
unbiased sample from the posterior distribution when given a data-vector :
hidden
i
j
visible
Weights Energies Probabilities
• Each possible joint configuration of the visible
and hidden units has an energy– The energy is determined by the weights and
biases.• The energy of a joint configuration of the visible
and hidden units determines its probability.• The probability of a configuration over the visible
units is found by summing the probabilities of all the joint configurations that contain it.
The Energy of a joint configuration
ji
ijjij
jji
ii whvbhbvv,hE,
)(
biases of units i and j
weight between units i and jEnergy with configuration
v on the visible units and h on the hidden units
binary state of visible unit i
indexes every connected visible-hidden pair
binary state of hidden unit j
Using energies to define probabilities
• The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.
• The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.
gu
guE
hvE
e
ehvp
,
),(
),(
),(
gu
guEh
hvE
e
e
vp
,
),(
),(
)(
partition function
A picture of the maximum likelihood learning algorithm for an RBM
0 jihv1 jihv
jihv
i
j
i
j
i
j
i
j
t = 0 t = 1 t = 2 t = infinity
jiji
ijhvhv
w
vp 0)(log
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
Contrastive divergence learning: A quick way to learn an RBM
0 jihv1 jihv
i
j
i
j
t = 0 t = 1
)( 10 jijiij hvhvw
Start with a training vector on the visible units.
Update all the hidden units in parallel
Update the all the visible units in parallel to get a “reconstruction”.
Update the hidden units again.
This is not following the gradient of the log likelihood. But it works well.
It is approximately following the gradient of another objective function.
reconstructiondata
16
How to learn a set of features that are good for reconstructing images of the digit 2
50 binary feature neurons
16 x 16 pixel
image
50 binary feature neurons
16 x 16 pixel
image
Increment weights between an active pixel and an active feature
Decrement weights between an active pixel and an active feature
data (reality)
reconstruction (lower energy than reality)
Bartlett
Reconstruction from activated binary featuresData
Reconstruction from activated binary featuresData
How well can we reconstruct the digit images from the binary feature activations?
New test images from the digit class that the model was trained on
Images from an unfamiliar digit class (the network tries to see every image as a 2)
Training a deep network
• First train a layer of features that receive input directly from the pixels.
• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.
• It can be proved that each time we add another layer of features we get a better model of the set of training images.– i.e. we assign lower energy to the real data and
higher energy to all other possible images.– The proof is complicated. It uses variational free
energy, a method that physicists use for analyzing complicated non-equilibrium systems.
– But it is based on a neat equivalence.
• The variables in h1 are exactly conditionally independent given v.– Inference is trivial. We just
multiply v by – This is because the explaining
away is eliminated by the layers above h1.
• Inference in the causal network is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data.
A causal network that is equivalent to an RBM
W
h2
v
h1
h3
W
W
etc.
TW W
• Learning the weights in an RBM is exactly equivalent to learning in an infinite causal network with tied weights.
A causal network that is equivalent to an RBM
v
h
W
W
h2
v
h1
h3
W
W
etc.
W
• First learn with all the weights tied
Learning a deep causal network
v
h1
1W
h2
v
h1
h3
etc.
1W
1W
1W
1W
• Then freeze the bottom layer and relearn all the other layers.
h1
h2
2W
h2
v
h1
h3
etc.
1W
2W
2W
2W
• Then freeze the bottom two layers and relearn all the other layers.
h2
h3
3W
h2
v
h1
h3
etc.
1W
2W
3W
3W
The generative model after learning 3 layers
• To generate data:
1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling.
2. Perform a top-down pass to get states for all the other layers.
So the lower level bottom-up connections are not part of the generative model
h2
data
h1
h3
2W
3W
1W
A neural model of digit recognition
2000 top-level neurons
500 neurons
500 neurons
28 x 28 pixel image
10 label
neurons
The model learns to generate combinations of labels and images.
To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by one or two iterations of the top-level associative memory.
The top two layers form an associative memory whose energy landscape models the low dimensional manifolds of the digits.
The energy valleys have names
Fine-tuning with the up-down algorithm: A contrastive divergence version of wake-sleep
• Replace the top layer of the causal network by an RBM– This eliminates explaining away at the top-level.– It is nice to have an associative memory at the top.
• Replace the sleep phase by a top-down pass starting with the state of the RBM produced by the wake phase.– This makes sure the recognition weights are trained in
the vicinity of the data.– It also reduces mode averaging. If the recognition
weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much.
Examples of correctly recognized handwritten digitsthat the neural network had never seen before
Its very good
How well does it discriminate on MNIST test set with no extra information about geometric distortions?
• Up-down net with RBM pre-training + CD10 1.25%• Support Vector Machine (Decoste et. al.) 1.4% • Backprop with 1000 hiddens (Platt) 1.5%• Backprop with 500 -->300 hiddens 1.5%• K-Nearest Neighbor ~ 3.3%
• Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.
Seeing what it is thinking
• The top level associative memory has activities over thousands of neurons.– It is hard to tell what the network is
thinking by looking at the patterns of activation.
• To see what it is thinking, convert the top-level representation into an image by using top-down connections.– A mental state is the state of a
hypothetical world in which the internal representation is correct.
The extra activation of cortex caused by a speech task. What were they thinking?
brain state
What goes on in its mind if we show it an image composed of random pixels and ask it
to fantasize from there?
mind brain
mind brain
mind brain
2000 top-level neurons
500 neurons
500 neurons
28 x 28 pixel
image
10 label
neurons
Samples generated by running the associative memory with one label clamped. Initialized by an up-pass from a random binary image. 20 iterations between samples.
Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling
between samples.
Learning with realistic labels
This network treats the labels in a special way, but they could easily be replaced by an auditory pathway.
2000 top-level units
500 units
500 units
28 x 28 pixel
image
10 label units
Learning with auditory labels
• Alex Kaganov replaced the class labels by binarized cepstral spectrograms of many different male speakers saying digits.
• The auditory pathway then had multiple layers, just like the visual pathway. The auditory and visual inputs shared the top level layer.
• After learning, he showed it a visually ambiguous digit and then reconstructed the visual input from the representation that the top-level associative memory had settled on after 10 iterations.
“six” “five”
reconstruction original visual input reconstruction
And now for something a bit more realistic
• Handwritten digits are convenient for research into shape recognition, but natural images of outdoor scenes are much more complicated.– If we train a network on patches from natural
images, does it produces sets of features that look like the ones found in real brains?
A network with local connectivity
image
Global connectivity
Local connectivityThe local connectivity between the two hidden layers induces a topography on the hidden units.
Features learned by a net that sees 100,000 patches of natural images.
The feature neurons are locally connected to each other.
All 125 errors
Best other machine learning technique gets 140 errors on this task.
(these results are without added distortions or geometric prior knowledge)
A simple learning algorithm
0 jiss1 jiss
i
j
i
j
data reconstruction
)( reconji
datajiij ssssw
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
“visible neurons” represent pixels
“hidden” neurons represent features
learning rate
Types of connectivity
• Feedforward networks– These compute a series of
transformations– Typically, the first layer is
the input and the last layer is the output.
• Recurrent networks– These have directed cycles
in their connection graph. They can have complicated dynamics.
– More biologically realistic.
hidden units
output units
input units
Binary stochastic neurons
• y is the probability of producing a spike.
iji
iwyinputexternaljtoinputtotal
0.5
00
1
jy
synaptic weight from i to j
output of neuron i
A simple learning algorithm
i
j
i
j
data reconstruction
1. Start with a training image on the visible neurons.
2. Pick binary states for the hidden neurons. Increment weights between active pairs.
3. Then pick binary states for the visible neurons
4. Then pick binary states for the hidden neurons again. Decrement weights between active pairs.
This changes the weights to increase the goodness of the data and
decrease the goodness of the reconstruction.
“visible neurons” represent pixels
“hidden” neurons represent features
A more interesting type of network
• Several politicians can be elected at the same time.– A politician corresponds to a feature and each familiar
object corresponds to a coalition of features.• The politicians decide who can vote (like in Florida).
• The whole system of voters and politicians can have several different stable states.– Conservative politicians discourage liberal voters from
voting.– Liberal politicians discourage conservative voters from
voting.• If we add some noise to the voting process, the system will
occasionally jump from one regime to another.
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons.
0.5
00
1
jjiji
i wsbsp
)exp(1)( 11
j
jiji wsb
)( 1isp
3
• Each neuron receives inputs from thousands of other neurons– A few neurons also get inputs from the sensory receptors- A few neurons send outputs to muscles.- Neurons use binary spikes of activity to communicate
• The effect that one neuron has on another is controlled by a synaptic weight– The weights can be
positive or negative
• The synaptic weights adapt so that the whole network learns to perform useful computations– Recognizing objects, understanding language, making
plans, controlling the body
How the brain works