neurons and neural networks ii. hopfield...

Neurons and neural networks II. Hopfield network

Perceptron recap• key ingredient: adaptivity of the system• unsupervised vs supervised learning• architecture for discrimination: single neuron — perceptron• error function & learning rule• gradient descent learning & divergence• regularisation• learning as inference

Interpreting learning as inferenceSo far: optimization wrt. an objective function

What’s this quirky regularizer, anyway?

Interpreting learning as inferenceLet’s interpret y(x,w) as a probability:

in a compact form:

the likelihood of the input data can be expressed with the original error function function

in a compact form:

the regularizer has the form of a prior!

in a compact form:

the regularizer has the form of a prior!

what we get in the objective function M(w): the posterior distribution of w:

Interpreting learning as inference Relationship between M(w) and the posterior

interpretation: minimizing M(w) leads to finding the maximum a posteriori estimate wMP

The log probability interpretation of the objective function retains:additivity of errors, whilekeeping the multiplicativity of probabilities

Interpreting learning as inference Properties of the Bayesian estimate

• The probabilistic interpretation makes our assumptions explicit:by the regularizer we imposed a soft constraint on the learned parameters, which expresses our prior expecations.

• An additional plus:beyond getting wMP we get a measure for learned parameter uncertainty

Interpreting learning as inference Demo

Interpreting learning as inference Making predictions

Up to this point the goal was optimization:

Are we equally confident in the two predictions?

The Bayesian answer exploits the probabilistic interpretation:

Interpreting learning as inference Calculating Bayesian predictions

Predictive probability:

Likelihood:

Weight posterior

Partition function:

Likelihood:

Weight posterior

Partition function:

Finally:

How to solve the integral?

How to solve the integral?Bad news: Monte Carlo integration is needed

Original estimate

Original estimate Bayesian estimate

Interpreting learning as inference Gaussian approximation

Taylor expansion around the MAP estimate

The Gaussian approximation:

Neural networks Unsupervised learning

Unsupervised learning: what is it about?

Capacity of a single neuron is limited: certain data can only be learnedSo far, we used a supervised learning paradigm: a teacher was necessaryto teach an input-output relation

Hopfield networks try to cure both

Hebb rule: an enlightening example

assuming 2 neurons and a weight modification process:

This simple rule realizes an associative memory!

• there are systems performing simple computations

• universality can be reached by combining these computations

• all universal system can perform the same computations

Reasoning, deduction & the nervous system

Turing machine

• there are systems performing simple computations

• universality can be reached by combining these computations

• all universal system can perform the same computations

Turing machine

arbitrary programming language can be used to

code all the programs

Walter Pitts 1923 - 1969

Reasoning, deduction & the nervous systemMcCulloch - Pitts neuron modell

Reasoning, deduction & the nervous systemMcCulloch - Pitts neuron modellMcCulloch - Pitts neuron modell

Two states • on • off

McCulloch - Pitts neuron modell

large w

small w

large w

small wlogical calculus of the brain

Sweet child of ours

“remarkable machine…[was] capable of what amounts to thought.”

Perceptron Frank Rosenblatt

Sweet child of ours

“remarkable machine…[was] capable of what amounts to thought.”

The New Yorker, December 6, 1958 P. 44

Perceptron Frank Rosenblatt

Linear discrimination

• The perceptron can learn a linear subspace for discrimination

graphically show the difference between a ‘good’ and ‘bad’representation for directly supporting object recognition.The representation in Figure 1b is good: it is easy todetermine if Joe is present, in spite of pose variation, bysimply placing the linear decision function (i.e. a hyper-plane) between Joe’s manifold and the other potentialimages in the visual world (just images of Sam in thiscase, but see Figure I in Box 2). By contrast, the repres-entation in Figure 1c is bad: the object manifolds aretangled, such that it is impossible to reliably separateJoe from the rest of the visual world with a linear decisionfunction. Figure 1d shows that this problem is not aca-demic – the manifolds of two real-world objects are hope-lessly tangled together in the retinal representation.

Note, however, that the two manifolds in Figure 1c,d donot cross or superimpose – they are like two sheets of papercrumpled together. This means that, although the retinalrepresentation cannot directly support recognition, it

implicitly contains the information to distinguish whichof the two individuals was seen. We argue that thisdescribes the computational crux of ‘everyday’ recognition:the problem is typically not a lack of information or noisyinformation, but that the information is badly formatted inthe retinal representation – it is tangled (but also see Box1). Although Figure 1 shows only two objects, the samearguments apply when more objects are in the world ofpossible objects – it just makes the problem harder, but forexactly the same reasons.

One way of viewing the overarching goal of the brain’sobject recognition machinery, then, is as a transformationfrom visual representations that are easy to build (e.g.center-surround filters in the retina), but are not easilydecoded (as in Figure 1c,d), into representations that we donot yet know how to build (e.g. representations in IT), butare easily decoded (e.g. Figure 1b). Although the idea ofrepresentational transformation has been stated under

Figure 1. Illustration of object tangling. In a neuronal population space, each cardinal axis is one neuron’s activity (e.g. firing rate over an !200 ms interval) and thedimensionality of the space is equal to the number of neurons. Although such high-dimensional spaces cannot be visualized, the three-dimensional views portrayed hereprovide fundamental insight. (a) A given image of a single object (here, a particular face) is one point in retinal image space. As the face’s pose is varied, the point travelsalong curved paths in the space, and all combinations of left/right and up/down pose (two degrees of freedom) lie on a two-dimensional surface, called the object manifold(in blue). Although only two degrees of freedom are shown for clarity, the same idea applies when other identity-preserving transformations (e.g. size, position) are applied.(b) The manifolds of two objects (two faces, red and blue) are shown in a common neuronal population space. In this case, a decision (hyper-) plane can be drawn cleanlybetween them. If the world only consisted of this set of images, this neuronal representation would be ‘good’ for supporting visual recognition. (c) In this case, the twoobject manifolds are intertwined, or tangled. A decision plane can no longer separate the manifolds, no matter how it is tipped or translated. (d) Pixel (retina-like) manifoldsgenerated from actual models of faces (14,400-dimensional data; 120 " 120 images) for two face objects were generated from mild variation in their pose, position, scaleand lighting (for clarity, only the pose-induced portion of the manifold is displayed). The three-dimensional display axes were chosen to be the projections that bestseparate identity, pose azimuth and pose elevation. Even though this simple example only exercises a fraction of typical real-world variation, the object manifolds arehopelessly tangled. Although the manifolds appear to cross in this three-dimensional projection, they do not cross in the high-dimensional space in which they live.

Opinion TRENDS in Cognitive Sciences Vol.11 No.8 335

www.sciencedirect.com

Universal function approximation

VS01CH17-Kriegeskorte ARI 4 November 2015 10:24

a b cy2

x 1 x 2

y2 = f (f (x W1) • W2)y2 = x W1 W2 = x W'

Figure 2Networks with nonlinear hidden units can approximate arbitrary nonlinear functions. (a) A feedforward neural network with a singlehidden layer. (b) Activation of the pink and blue hidden units as a function of the input pattern (x1, x2) when the hidden units havelinear activation functions. Each output unit ( y2) will compute a weighted combination of the ramp-shaped (i.e., linear) activations ofthe hidden units. Thus, the output remains a linear combination of the input pattern. A linear hidden layer is not useful because theresulting network is equivalent to a linear network without a hidden layer intervening between input and output. (c) Activation of thepink and blue hidden units when these have sigmoid activation functions. Arbitrary continuous functions can be approximated in theoutput units ( y2) by weighted combinations of a sufficient number of nonlinear hidden-unit outputs ( y1).

Universal functionapproximator:model family that canapproximate anyfunction that mapsinput patterns tooutput patterns (witharbitrary precisionwhen allowed enoughparameters)

ramp functions, and thus itself computes a ramp function. A multilayer network of linear units isequivalent to a single-layer network whose weights matrix W′ is the product of the weights matricesWi of the multilayer network. Nonlinear units are essential because their outputs provide buildingblocks (Figure 2c) whose linear combination one level up enables us to approximate any desiredmapping from inputs to outputs, as described in the next section.

A unit in a neural network uses its input weights w to compute a weighted sum z of its inputactivities x and passes the result through a (typically monotonic) nonlinear function f to generate itsactivation y (Figure 1a). In early models, the nonlinearity was simply a step function (McCulloch& Pitts 1943, Rosenblatt 1958, Minsky & Papert 1972), making each unit a linear discriminantimposing a binary threshold. For a single threshold unit, the perceptron learning algorithm pro-vides a method for iteratively adjusting the weights (starting with zeros or random weights) so asto get as many training input–output pairs as possible right. However, hard thresholding entailsthat, for a given pair of an input pattern and a desired output pattern, small changes to the weightswill often make no difference to the output. This makes it difficult to learn the weights for a multi-layer network by gradient descent, where small adjustments to the weights are made to iterativelyreduce the errors. If the hard threshold is replaced by a soft threshold that continuously varies,such as a sigmoid function, gradient descent can be used for learning.

Networks with Nonlinear Hidden Units Are Universal Function ApproximatorsThe particular shape of the nonlinear activation function does not matter to the class of input–output mappings that can be represented. Feedforward networks with at least one layer of hiddenunits intervening between input and output layers are universal function approximators: Givena sufficient number of hidden units, a network can approximate any function of the inputs inthe output units. Continuous functions can be approximated with arbitrary precision by addinga sufficient number of hidden units and suitably setting the weights (Schafer & Zimmermann2007, Hornik 1991, Cybenko 1989). Figure 2c illustrates this process for two-dimensional inputs:

422 Kriegeskorte

• Multi-layer neural network can combine discrimination subspaces

• Multi-layer perceptron is a universal function approximator — albeit not necessarily effective

Deep networks as universal functional approximators

VS01CH17-Kriegeskorte ARI 4 November 2015 10:24

Shallow feedforward(1 hidden layer)

RecurrentDeep feedforward(>1 hidden layer)

Output

Hidden

Linear

Threshold

SigmoidRectified linear

–1–2 –1 21

b w1x1

z = b + Σ xiwi

Figure 1Artificial neural networks: basic units and architectures. (a) A typical model unit (left) computes a linearcombination z of its inputs xi using weights wi and adding a bias b. The output y of the unit is a function of z,known as the activation function (right). Popular activation functions include linear ( gray), threshold (black),sigmoid (hyperbolic tangent shown here, blue), and rectified linear (red ) functions. A network is referred to asfeedforward (b,c) when its directed connections do not form cycles and as recurrent (d ) when they do formcycles. A shallow feedforward network (b) has zero or one hidden layers. Nonlinear activation functions inhidden units enable a shallow feedforward network to approximate any continuous function (with theprecision depending on the number of hidden units). A deep feedforward network (c) has more than onehidden layer. Recurrent nets generate ongoing dynamics, lend themselves to the processing of temporalsequences of inputs, and can approximate any dynamical system (given a sufficient number of units).

critical arguments, upcoming challenges, and the way ahead toward empirically justified modelsof complex biological brain information processing.

A PRIMER ON NEURAL NETWORKS

A Unit Computes a Weighted Sum of Its Inputs and ActivatesAccording to a Nonlinear Function

We refer to model neurons as units to maintain a distinction between biological reality andhighly abstracted models. The perhaps simplest model unit is a linear unit, which outputs alinear combination of its inputs (Figure 1a). Such units, combined to form networks, can nevertranscend linear combinations of the inputs. This insight is illustrated in Figure 2b, which showshow an output unit that linearly combines intermediate-layer linear-unit activations just adds up

www.annualreviews.org • Deep Neural Networks 421

Neural networks The Hopfield network

Architecture: a set of I neuronsconnected by symmetric synapses of weight wij no self connections: wii=0output of neuron i: xi

Activity rule:

Synchronous/ asynchronous update

Learning rule:

Neural networks The Hopfield network

Architecture: a set of I neuronsconnected by symmetric synapses of weight wij no self connections: wii=0output of neuron i: xi

Activity rule:

Synchronous/ asynchronous update

Learning rule:

alternatively, a continuous network can be defined as:

Neural networks Stability of Hopfield network

Are the memories stable?

Necessary conditions: symmetric weights; asynchronous update

Robust against perturbation ofa subset of weights

Neural networks Capacity of Hopfield network

How many traces can be memorized by a network of I neurons?

Neural networks Capacity of Hopfield network

neurons and neural networks ii. hopfield...

Documents

feedback networks and associative memories. content...

perceptron - keio university...perceptron: " tuvz# yw *x n...

linear classification models:...

milo kotlar 2012/115 single layer perceptron linear...

hopfield rapor

hopfield networks

václav hlavác- linear classifiers, a perceptron family

hopfield network

perceptron linear classifiers

linear models and the perceptron algorithm

matlab sigmoid perceptron linear training small, round...

neural networks: multi-layer perceptron and hopfield network

neural networks. plan perceptron linear discriminant...

learning linear separators perceptron,...

5. linear discriminant functions - computer...

matlab matlab sigmoid sigmoid perceptron perceptron linear...

hopfield 1slide

perceptron, kernels, and svm -...

artificial intelligence – lecture...

linear models and the perceptron d algorithm t wixi...