neurons and neural networks ii. hopfield...
Post on 14-Mar-2020
0 Views
Preview:
TRANSCRIPT
1
Neurons and neural networks II. Hopfield network
2
Perceptron recap• key ingredient: adaptivity of the system• unsupervised vs supervised learning• architecture for discrimination: single neuron — perceptron• error function & learning rule• gradient descent learning & divergence• regularisation• learning as inference
3
Interpreting learning as inferenceSo far: optimization wrt. an objective function
where
3
Interpreting learning as inferenceSo far: optimization wrt. an objective function
where
What’s this quirky regularizer, anyway?
4
Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:
4
Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:
in a compact form:
4
Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:
in a compact form:
the likelihood of the input data can be expressed with the original error function function
4
Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:
in a compact form:
the likelihood of the input data can be expressed with the original error function function
the regularizer has the form of a prior!
4
Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:
in a compact form:
the likelihood of the input data can be expressed with the original error function function
the regularizer has the form of a prior!
what we get in the objective function M(w): the posterior distribution of w:
5
Interpreting learning as inference Relationship between M(w) and the posterior
interpretation: minimizing M(w) leads to finding the maximum a posteriori estimate wMP
The log probability interpretation of the objective function retains:additivity of errors, whilekeeping the multiplicativity of probabilities
6
Interpreting learning as inference Properties of the Bayesian estimate
6
Interpreting learning as inference Properties of the Bayesian estimate
• The probabilistic interpretation makes our assumptions explicit:by the regularizer we imposed a soft constraint on the learned parameters, which expresses our prior expecations.
• An additional plus:beyond getting wMP we get a measure for learned parameter uncertainty
7
Interpreting learning as inference Demo
7
Interpreting learning as inference Demo
7
Interpreting learning as inference Demo
7
Interpreting learning as inference Demo
8
Interpreting learning as inference Making predictions
Up to this point the goal was optimization:
8
Interpreting learning as inference Making predictions
Up to this point the goal was optimization:
8
Interpreting learning as inference Making predictions
Up to this point the goal was optimization:
Are we equally confident in the two predictions?
8
Interpreting learning as inference Making predictions
Up to this point the goal was optimization:
Are we equally confident in the two predictions?
The Bayesian answer exploits the probabilistic interpretation:
9
Interpreting learning as inference Calculating Bayesian predictions
Predictive probability:
9
Interpreting learning as inference Calculating Bayesian predictions
Predictive probability:
Likelihood:
Weight posterior
Partition function:
9
Interpreting learning as inference Calculating Bayesian predictions
Predictive probability:
Likelihood:
Weight posterior
Partition function:
Finally:
10
Interpreting learning as inference Calculating Bayesian predictions
How to solve the integral?
10
Interpreting learning as inference Calculating Bayesian predictions
How to solve the integral?Bad news: Monte Carlo integration is needed
11
12
13
Interpreting learning as inference Calculating Bayesian predictions
13
Interpreting learning as inference Calculating Bayesian predictions
Original estimate
13
Interpreting learning as inference Calculating Bayesian predictions
Original estimate Bayesian estimate
14
Interpreting learning as inference Gaussian approximation
Taylor expansion around the MAP estimate
14
Interpreting learning as inference Gaussian approximation
The Gaussian approximation:
Taylor expansion around the MAP estimate
14
Interpreting learning as inference Gaussian approximation
The Gaussian approximation:
Taylor expansion around the MAP estimate
15
Neural networks Unsupervised learning
Unsupervised learning: what is it about?
Capacity of a single neuron is limited: certain data can only be learnedSo far, we used a supervised learning paradigm: a teacher was necessaryto teach an input-output relation
Hopfield networks try to cure both
Hebb rule: an enlightening example
assuming 2 neurons and a weight modification process:
This simple rule realizes an associative memory!
• there are systems performing simple computations
• universality can be reached by combining these computations
• all universal system can perform the same computations
Reasoning, deduction & the nervous system
Turing machine
• there are systems performing simple computations
• universality can be reached by combining these computations
• all universal system can perform the same computations
Reasoning, deduction & the nervous system
Turing machine
arbitrary programming language can be used to
code all the programs
Reasoning, deduction & the nervous system
Walter Pitts 1923 - 1969
Reasoning, deduction & the nervous systemMcCulloch - Pitts neuron modell
Reasoning, deduction & the nervous systemMcCulloch - Pitts neuron modellMcCulloch - Pitts neuron modell
Reasoning, deduction & the nervous systemMcCulloch - Pitts neuron modellMcCulloch - Pitts neuron modell
Two states • on • off
McCulloch - Pitts neuron modell
Reasoning, deduction & the nervous system
McCulloch - Pitts neuron modell
X
Reasoning, deduction & the nervous system
McCulloch - Pitts neuron modell
X
Reasoning, deduction & the nervous system
McCulloch - Pitts neuron modell
X
Reasoning, deduction & the nervous system
large w
small w
McCulloch - Pitts neuron modell
X
Reasoning, deduction & the nervous system
large w
small wlogical calculus of the brain
Sweet child of ours
“remarkable machine…[was] capable of what amounts to thought.”
Perceptron Frank Rosenblatt
Sweet child of ours
“remarkable machine…[was] capable of what amounts to thought.”
The New Yorker, December 6, 1958 P. 44
Perceptron Frank Rosenblatt
Linear discrimination
• The perceptron can learn a linear subspace for discrimination
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
Linear discrimination
• The perceptron can learn a linear subspace for discrimination
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
Linear discrimination
• The perceptron can learn a linear subspace for discrimination
graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.
Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it
implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.
One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under
Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335
www.sciencedirect.com
Universal function approximation
VS01CH17-Kriegeskorte ARI 4 November 2015 10:24
a b cy2
y1
x 1 x 2
W2
W1
y 1
x 2
x 1
y 1
x 2
x 1
y2 = f (f (x W1) • W2)y2 = x W1 W2 = x W'
Figure 2Networks with nonlinear hidden units can approximate arbitrary nonlinear functions. (a) A feedforward neural network with a singlehidden layer. (b) Activation of the pink and blue hidden units as a function of the input pattern (x1, x2) when the hidden units havelinear activation functions. Each output unit ( y2) will compute a weighted combination of the ramp-shaped (i.e., linear) activations ofthe hidden units. Thus, the output remains a linear combination of the input pattern. A linear hidden layer is not useful because theresulting network is equivalent to a linear network without a hidden layer intervening between input and output. (c) Activation of thepink and blue hidden units when these have sigmoid activation functions. Arbitrary continuous functions can be approximated in theoutput units ( y2) by weighted combinations of a sufficient number of nonlinear hidden-unit outputs ( y1).
Universal functionapproximator:model family that canapproximate anyfunction that mapsinput patterns tooutput patterns (witharbitrary precisionwhen allowed enoughparameters)
ramp functions, and thus itself computes a ramp function. A multilayer network of linear units isequivalent to a single-layer network whose weights matrix W′ is the product of the weights matricesWi of the multilayer network. Nonlinear units are essential because their outputs provide buildingblocks (Figure 2c) whose linear combination one level up enables us to approximate any desiredmapping from inputs to outputs, as described in the next section.
A unit in a neural network uses its input weights w to compute a weighted sum z of its inputactivities x and passes the result through a (typically monotonic) nonlinear function f to generate itsactivation y (Figure 1a). In early models, the nonlinearity was simply a step function (McCulloch& Pitts 1943, Rosenblatt 1958, Minsky & Papert 1972), making each unit a linear discriminantimposing a binary threshold. For a single threshold unit, the perceptron learning algorithm pro-vides a method for iteratively adjusting the weights (starting with zeros or random weights) so asto get as many training input–output pairs as possible right. However, hard thresholding entailsthat, for a given pair of an input pattern and a desired output pattern, small changes to the weightswill often make no difference to the output. This makes it difficult to learn the weights for a multi-layer network by gradient descent, where small adjustments to the weights are made to iterativelyreduce the errors. If the hard threshold is replaced by a soft threshold that continuously varies,such as a sigmoid function, gradient descent can be used for learning.
Networks with Nonlinear Hidden Units Are Universal Function ApproximatorsThe particular shape of the nonlinear activation function does not matter to the class of input–output mappings that can be represented. Feedforward networks with at least one layer of hiddenunits intervening between input and output layers are universal function approximators: Givena sufficient number of hidden units, a network can approximate any function of the inputs inthe output units. Continuous functions can be approximated with arbitrary precision by addinga sufficient number of hidden units and suitably setting the weights (Schafer & Zimmermann2007, Hornik 1991, Cybenko 1989). Figure 2c illustrates this process for two-dimensional inputs:
422 Kriegeskorte
Ann
u. R
ev. V
is. S
ci. 2
015.
1:41
7-44
6. D
ownl
oade
d fr
om w
ww
.ann
ualre
view
s.org
Acc
ess p
rovi
ded
by S
tanf
ord
Uni
vers
ity -
Mai
n C
ampu
s - R
ober
t Cro
wn
Law
Lib
rary
on
07/2
6/16
. For
per
sona
l use
onl
y.
• Multi-layer neural network can combine discrimination subspaces
• Multi-layer perceptron is a universal function approximator — albeit not necessarily effective
Deep networks as universal functional approximators
VS01CH17-Kriegeskorte ARI 4 November 2015 10:24
a
Shallow feedforward(1 hidden layer)
RecurrentDeep feedforward(>1 hidden layer)
Output
Hidden
Input
Linear
Threshold
SigmoidRectified linear
1
0
–1–2 –1 21
1
0
y
y
b w1x1
w2
x2
z = b + Σ xiwi
i
cb d
Figure 1Artificial neural networks: basic units and architectures. (a) A typical model unit (left) computes a linearcombination z of its inputs xi using weights wi and adding a bias b. The output y of the unit is a function of z,known as the activation function (right). Popular activation functions include linear ( gray), threshold (black),sigmoid (hyperbolic tangent shown here, blue), and rectified linear (red ) functions. A network is referred to asfeedforward (b,c) when its directed connections do not form cycles and as recurrent (d ) when they do formcycles. A shallow feedforward network (b) has zero or one hidden layers. Nonlinear activation functions inhidden units enable a shallow feedforward network to approximate any continuous function (with theprecision depending on the number of hidden units). A deep feedforward network (c) has more than onehidden layer. Recurrent nets generate ongoing dynamics, lend themselves to the processing of temporalsequences of inputs, and can approximate any dynamical system (given a sufficient number of units).
critical arguments, upcoming challenges, and the way ahead toward empirically justified modelsof complex biological brain information processing.
A PRIMER ON NEURAL NETWORKS
A Unit Computes a Weighted Sum of Its Inputs and ActivatesAccording to a Nonlinear Function
We refer to model neurons as units to maintain a distinction between biological reality andhighly abstracted models. The perhaps simplest model unit is a linear unit, which outputs alinear combination of its inputs (Figure 1a). Such units, combined to form networks, can nevertranscend linear combinations of the inputs. This insight is illustrated in Figure 2b, which showshow an output unit that linearly combines intermediate-layer linear-unit activations just adds up
www.annualreviews.org • Deep Neural Networks 421
Ann
u. R
ev. V
is. S
ci. 2
015.
1:41
7-44
6. D
ownl
oade
d fr
om w
ww
.ann
ualre
view
s.org
Acc
ess p
rovi
ded
by S
tanf
ord
Uni
vers
ity -
Mai
n C
ampu
s - R
ober
t Cro
wn
Law
Lib
rary
on
07/2
6/16
. For
per
sona
l use
onl
y.
16
Neural networks The Hopfield network
Architecture: a set of I neuronsconnected by symmetric synapses of weight wij no self connections: wii=0output of neuron i: xi
Activity rule:
Synchronous/ asynchronous update
Learning rule:
;
16
Neural networks The Hopfield network
Architecture: a set of I neuronsconnected by symmetric synapses of weight wij no self connections: wii=0output of neuron i: xi
Activity rule:
Synchronous/ asynchronous update
Learning rule:
alternatively, a continuous network can be defined as:
;
17
Neural networks Stability of Hopfield network
Are the memories stable?
Necessary conditions: symmetric weights; asynchronous update
17
Neural networks Stability of Hopfield network
Are the memories stable?
Necessary conditions: symmetric weights; asynchronous update
17
Neural networks Stability of Hopfield network
Are the memories stable?
Necessary conditions: symmetric weights; asynchronous update
Robust against perturbation ofa subset of weights
18
Neural networks Capacity of Hopfield network
How many traces can be memorized by a network of I neurons?
19
Neural networks Capacity of Hopfield network
top related