cognitive computing 2012 the computer and the mind 4. connectionism professor mark bishop

Cognitive Computing 2012

The computer and the mind

4. CONNECTIONISM

Professor Mark Bishop

20/04/23 (c) Bishop: An introduction to Cognitive Science 2

The representational theory of mind

Cognitive states are relations to mental representations which have content.

A cognitive state is a state (of mind) denoting knowledge; understanding; beliefs etc.

Cognitive processes are mental operations on these representations.


Computational theories of mind Cognitive states are computational relations to

mental representations which have content.

Cognitive processes (changes in cognitive states) are computational operations on the mental representations.

Strong computational theories of mind claim that the mental representations are themselves fundamentally computational in character.

Hence the mind - thoughts, beliefs, intelligence, problem solving etc. - is ‘merely’ a computational machine.

Computational theories of mind typically come in two flavours:

The connectionist computational theory of mind, (CCTM);

The digital computational theory of mind, (DCTM).

Mental Representation

Computations,

e.g. +, -, x, / etc.

“Grass is green”


Basic connectionist computational theory of mind (CCTM) The basic connectionist theory of mind is

neutral on exactly what constitutes [connectionist]’ mental representations’

I.e. The connectionist ‘mental representations’ might not be realised ‘computationally’.

Cognitive states are computational relations to mental representations which have content.

Under the CCTM the computational architecture and (mental) representations are connectionist.

Hence for CCTM cognitive processes (changes in cognitive states) are computational operations on these connectionist mental representations.

Mental Representation

Computations,

e.g. +, -, x, / etc.

Happiness


A ‘non-computational’ connectionist theory of mind Conceptually it is also possible to formulate a connectionist non-computational theory of

mind where:

Cognitive states are relations to mental representations which have content.

But the mental representations might not be ‘computational’ in character; perhaps they are instantiated on a non-computational connectionist architecture

AND / OR

the relation between cognitive state and mental representation is non-computational; or the relationship between one cognitive state and the next is non-computational.

The term ‘non-computational’ here typically refers to a mode of [information] processing that, in principle, cannot be carried out via Turing Machine.


The connectionist computational theory of mind

A form of ‘Strong AI’ which holds that a suitably programmed computer ‘really is a mind’, (it has thoughts, beliefs, intelligence etc.):

Cognitive states are computational relations to fundamentally computational mental representations which have content defined by their core computational structure.

Cognitive processes (changes in cognitive states) are computational operations on these computational mental representations.

The computational architecture and representations are computationally connectionist.


Artificial neural networks What is Neural Computing /

Connectionism?

It defines a mode of computing that seeks to include the style of computing used within the brain.

It is a style of computing based on learning from experience as opposed to classical, tightly specified, algorithmic, methods.

A Definition:

“Neural computing is the study of networks of adaptable nodes which, through a process of learning from task examples, store experiential knowledge and make it available for use.”


The link between connectionism and associationism

By considering that:

the input nodes of an artificial neural network represent data from sensory transducers (the 'sensations');

the internal (hidden) network nodes to encode ideas;

the inter-node weights indicate strengths between ideas;

the output nodes define behaviour;

… then we see a correspondence between connectionism and associationism.


The neuron: the adaptive node of the brain

Within the brain neurons are often organized into complex regular structures. Input to neurons occurs at points called synapses located on the cell’s dendritic tree.

Synapses are either excitatory, where activity aids the overall firing of the neuron, or inhibitory where activity inhibits the firing of the neuron.

The neuron effectively takes all firing signals into account by summing the synaptic effects and firing if this is greater than a firing threshold, T.

The cell’s output is transmitted along a fibre called the axon. A neuron is said to fire when the axon transmits burst of pulses at around 100Hz.


The McCulloch/Pitts cell In the MCP model adaptability comes from representing each

synaptic junction by a variable weight Wi, indicating the degree to which the neuron should react to this particular input.

By convention positive weights represent excitatory synapses and negative inhibitory synapses.

The neuron firing threshold is represented by a variable T In modern MCP cells T is usually clamped to zero and a

threshold implemented using a variable bias, b. A bias is simply a weight connected to an input clamped

to [+1].

In the MCP model the firing of the neuron is represented by the number 1, and no firing by 0.

Equivalent to a proposition TRUE or FALSE “Thus in Psychology, .. , the fundamental relations are those of

two valued logic”, MCP (1943).

Activity at the ith input to the neuron is represented by the symbol Xi and the effect of the ith synapse by a weight Wi.

Net input at a synapse on the MCP cell is: Xi x Wi

The MCP cell will fire if: (( Xi x Wi) + b) 0


So, what type of tasks can neural networks do?

From McCulloch & Pitts, (1943), a network of MCP cells can , “compute only such numbers as a Turing Machine; second that each of the latter numbers can be computed by such a net”.

A neural network classifier, (above) maps an arbitrary input vector to an (arbitrary), output class.


Vector association An associative neural network is one

that maps (associates), a given input vector to a particular output vector.

Associative Networks in 'prediction'.

eg. Given input vector [age and alcohol consumed], map to the output vector, [the subjects response time].


What is a learning rule?

To enable a neural network to either associate or classify correctly we need to correctly specify all its weights and thresholds.

In a typical network there may be many thousands of weight and threshold values.

A neural network learning rule is an procedure for automatically calculating these values. Typically there are far too many to calculate by hand.


Hebbian learning “When an axon of cell A is near enough to excite cell

B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased,”

... from, Hebb, D., (1949), The Organisation of Behaviour.

ie. When two neurons are simultaneously excited then the strength of the connection between them should be increased.

"The change in weight connecting input Ii and output Oj is proportional (via a learning rate tau, ) to the product of their simultaneous activations."

Wij = Ii Oj


Training sets

The function that the neural network is to learn is defined by its ‘training set’.

For example, to learn the logical OR function the training set would consist of four input-output vector pairs defined as follows.

The OR Function

Pat I/P1 I/P2 O/P

1. 0 0 0

2. 0 1 1

3. 1 0 1

4. 1 1 1


Rosenblatt’s perceptron

When Rosenblatt first published information on the, ‘Perceptron Convergence Procedure’ in 1959, it was seen as a great advance on the work of Hebb.

The full (‘classical’) perceptron model can be divided into three layers (see opposite):


Perceptron structure

The First Layer (Sensory or S-Units) The first layer, retina, comprises of a regular array of S-Units.

The Second Layer (Association or A-Units) The input to each A-Unit is the weighted sum of the output of a randomly selected set of S-

Units. These weights do not change. Thus A-Units respond only to particular patterns, extracting specific localized features from

the input.

The Third Layer (Response or R-Units) Each R-Unit has a set of variable weighted connection to a set of A-Units. An R-Unit outputs

+1 if the sum of its weighted input is greater than a threshold T, -1 otherwise. In some perceptron models, an active R-Unit will inhibit all A-Units not in its input set.


The ‘perceptron convergence procedure’

If the perceptron response is correct, then no change is made in the weights to R-Units.

If the response of an R-Unit is incorrect then it is necessary to:

Decrement all active weights if the R-Unit fires when it is not meant to and increase the threshold.

Or conversely increment active weights and decrement the threshold, if the R-Unit is not firing when it should.

The Perceptron Convergence Theorem (Rosenblatt) ... states that the above procedure is guaranteed to find a set of weights to perform a

specified mapping on a single layer network, if such a set of weights exist!


The ‘order’ of a perceptron The order of a perceptron is defined as the largest

number of inputs to any of its A-Units. Perceptrons will only be useful if this 'order' remains

constant as the size of the retina is increased.

Consider a simple problem - the perceptron should fire if there is one or more groups of [2*2] black pixels on the input retina.

Opp. - A [4x4] blob detecting Perceptron

This problem requires that perceptron has as many A-Units as there are pixels on the retina, less duplications due to edge effects. Each A-Unit covers a [2*2] square and computes the AND of its inputs.

If all the weights to the R-Unit are unity and the threshold is just lower than unity, then the perceptron will fire if there is a black square anywhere on the retina.

The order of the problem is thus four O(4). This is order remains constant irrespective of the size of the retina.


The delta rule: a modern formulation of the perceptron learning procedure

The modern formulation of the single layer perceptron learning rule for changing weights in a single layer network of MCP cells, following the presentation of input/output training pair, P, is:

p Wij = (Tpj - Opj) Ipi = pj Ipi

is called the learning rate, (eta). (Tpj - Opj) is the error (or delta) term, pj, for the jth neuron.

Ipi is the ith element of the input vector, Ip.


Two input MCP cell

The output function can be represented in two dimensions

Using the x-axis for one input The y-axis for the other.

Examining the MCP equation for two inputs: X1 W1 + X2 W2 > T

The MCP output function can be represented by a line dividing the two dimensional input space into two areas.

The above equation can be re-written as an equation representing the line dividing the input space into two classes:

X1 W1 + X2 W2 = T OR X2 = T / W2 - X1 W1 / W2


Linearly separable problems The two input MCP cell can

correctly classify any function that can be separated by a straight dividing line in input space.

This class of problems are defined as ‘Linearly Separable’ problems. eg. the OR/AND functions.

The MCP threshold parameter performs a simple affine transformation on the line dividing the two classes.


Linearly inseparable problems

There are many problems that cannot be linearly divided in input space

Minsky and Papert defined these, ‘Hard Problems’.

The most famous example of this class of problem is the ‘XOR’ problem.

The two input XOR problem is not linearly separable in two dimensions

See figure opposite.


To make a problem linearly separable

To solve the two input XOR problem it needs to be made linearly separable in input space.

Hence an extra input (dimension) is required.

Consider an XOR function defined by three inputs (a,b,c), where (c = a AND b)

Thus embedding the 2 input XOR in a 3 dimensional input space.

In general a two class, k-input problem can be embedded in a higher n-dimensional hypercube (n > k).

A two class problem is linearly separable in n dimensions if there exists a hyper-plane to separate the classes.

cf. The ‘Support Vector Machine’ Here we map from an input space, (where data are not

linearly separable), to a sufficiently large feature space, where classes are linearly separable.


Hard problems

In their book ‘Perceptrons’, Minsky & Papert showed that there were several simple image processing tasks that could not be performed by Single Layer Perceptrons (SLP‘s) of fixed order.

All these problems are easy to compute using ‘conventional’ algorithmic methods.


Connectedness A Diameter Limited Perceptron is one where the

inputs to an A-Units must fall within a receptive field of size D.

Clearly only (b) and (c) are connected, hence the perceptron should fire only on (b) and (c).

The A-Units can be divided into three groups. Those on the left, the middle and the right of the image.

Clearly for images (a) & (c) it is only the left group that can tell the difference, hence there must be higher weights activated by the left A-Units in image (c) than image (a).

Clearly for images (a) & (b) it is only the right group that can tell the difference, hence there must be higher weights activated by the right A-Units on (b) than on (a).

However the above two requirements give (d) higher activation than (b) and (c), which implies that if a threshold is found that can classify (b) & (c) as connected, then it will incorrectly classify (d)!

Multi-layer Perceptrons Solutions to Minsky & Papert’s Hard problems arose with the development of learning rules for multi-layer

perceptions.

The most famous of these is called ‘Back [error] Propagation’ and was initially developed by the control engineer Paul Werbos and published in the Appendix to this PhD thesis in 1974, but was ignored form may years. Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis,

Harvard University, 1974

Back propgation was independently rediscovered by Le Cun and published (in French) in 1985 . Y. LeCun: Une procédure d'apprentissage pour réseau a seuil asymmetrique (a Learning Scheme for Asymmetric Threshold

Networks), Proceedings of Cognitiva 85, 599-604, Paris, France, 1985,

However the rule gained international renown with the publication of Rumelhart & McClelland’s ‘Parallel Distributed Processing’ texts in the early 1980s and they are the authors most stronfly associated with it. Rumelhart, D.E., J.L. McClelland and the PDP Research Group (1986). Parallel Distributed Processing: Explorations in the

Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: MIT Press.


cognitive computing 2012 the computer and the mind 4. connectionism professor mark bishop

Documents

cognitive science

computational relations

computational operations

computational machine

computational architecture

cognitive computing

term noncomputational

core computational structure