self-organised learning in the chialvo-bak model

7/29/2019 Self-Organised Learning in the Chialvo-Bak Model

1/55

Self-Organised Learning in the Chialvo-Bak Model

MSc Project

Marco Brigham

T

HE

UN I V E R

SIT

Y

O

F

ED

I N BU

RG

H

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2009


2/55

Abstract

A review of the Chialvo-Bak model is presented, for the two-layer neural network topology. A

novel Markov Chain representation is proposed that yields several important analytical quanti-

ties and supports a learning convergence argument. The power law regime is re-examined under

this new representation and is found to be limited to learning under small mapping changes.

A parallel between the power law regime and the biological neural avalanches is proposed. A

mechanism to avoid the permanent tagging of synaptic weights of the selective punishment rule

is proposed.

i


3/55

Acknowledgements

I wish to thank Dr. Mark van Rossum for his tireless support and attentive guidance, and for

having accepted to supervise me in the first place.

To Dr. J. Michael Herrmann I wish to thank the very creative and rewarding discussions on the

holistic merits of the Chialvo-Bak model.

To Dr. Wolfgang Maass and his team at the Institute for Theoretical Computer Science at

T.U. Graz, I wish to thank the precious feedback and fruitful discussions received after the first

talk on this MSc project.

ii


4/55

Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own

except where explicitly stated otherwise in the text, and that this work has not been submitted

for any other degree or professional qualification except as specified.

(Marco Brigham)

iii


5/55

To the memory of Per Bak, whose ideas live on and inspire.

iv


6/55

Contents

1 Introduction 21.1 Brief literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The Two-Layer Topology 72.1 Basic Principles and Learning . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Interference events . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Synaptic Landscape . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.3 Neural avalanches . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Storing Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Advanced Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Research Results 213.1 -band Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Desaturation strategies . . . . . . . . . . . . . . . . . . . . . . . 223.1.2 Global tag threshold . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Markov Chain Representation . . . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Statistical properties . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Markov chain representation: numerical evidence . . . . . . . . . 303.2.3 Analytical solution for (2, , ) . . . . . . . . . . . . . . . . . 333.2.4 Alternate formulation: graph transitions . . . . . . . . . . . . . . 363.2.5 Analytical solution: numerical evidence . . . . . . . . . . . . . . 373.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Learning Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Power-Law Behaviour and Neural Avalanches . . . . . . . . . . . . . . . 443.4.1 Biological interpretation . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Conclusion 47

1


7/55

Chapter 1

Introduction

The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999, withthe stated goal of identifying some universal and simple mechanism which allowsa large number of neurons to connect autonomously in a way that helps the organismto survive [8]. Their effort resulted in a schematic brain model of self-organised

learning and adaptation that operates using the principle of satisficing [1].

In common with other models authored by P. Bak, is a patent minimalism of form,where models are succinctly defined by simple, local and stochastic interaction rulesthat reflect the most basic assumptions of the real-world system. However simple andminimalistic, these models manage to reproduce complex and emergent behaviour thatis observed in the real-world systems [2].

In the Chialvo-Bak model, the basic properties of neurons and neural networks arerepresented by simple, local and stochastic dynamical rules that support the processesof learning, memory and adaptation. The most basic operations in the model, the

node activation and the synaptic plasticity rules, are regulated by Winner-Take-All(WTA) dynamics and learning by synaptic depression, respectively. These mechanismsmay correspond to well accepted physiological mechanisms [14] [10] [12], which suggeststhe biological feasibility of the model.

The present work focused on extending the analytical understanding of the model.A Markov chain representation for the simple two-layer topology is proposed, wherethe states of the chain correspond to the learning states of the network. This represen-tation provides a good statistical description of the model and supports an argumentfor the learning convergence.

A power law tail in the learning time distribution, corresponding to an order-disorderphase transition in the model, was proposed by J.R Wakeling [20]. This result wasspecific to the slow change mode, where the network is made to learn a succession ofsmall mapping changes. The power law behaviour was re-examined under the Markovchain representation for other mapping change modes and was only reproduced in theslow change mode.

An argument is provided for drawing a parallel between the above power law behaviourand the biological neural avalanches, evidenced experimentally J. Beggs and D. Plenz[4][5] in 2003. These correspond to the propagation of spontaneous activity in neural

2


8/55

networks with power law behaviour in the event size distribution.

The ability to store previously successful configurations is enabled by a selective pun-ishment mechanism [8] [1], where successful synaptic weights are depressed less severelywhen no longer leading to the correct mappings. The selective punishment mechanismhas a known ageing effect [1] that is related to the permanent tagging of the successful

synaptic weights. A mechanism to avoid the permanent tagging in order to maintainthe performance advantage of selective punishment is proposed.

This document is organised as follows:

Chapter 1 presents a succinct introduction to the Chialvo-Bak model and the rele-vant literature.

Chapter 2 introduces the Chialvo-Bak model in the simple two-layer topology, cover-ing the learning modes, the selective punishment rule and the power-law tail behaviour.

Chapter 3 presents the research results of this MSc project.

Chapter 4 presents the conclusion and future work.

1.1 Brief literature review

A brief review on the research papers related to the Chialvo-Bak model is presented be-low. The purpose of this review is to broadly describe the areas of the model that havealready been investigated to considerable depth. Detailed descriptions of the modelthat support the present work are provided in Section 2.1.

The literature on the Chialvo-Bak model can be grouped into papers that follow theoriginal formulation of the model and papers that extend the model to different work-ing principles and dynamic rules. As the present work is closely aligned to the firstapproach, so is the focus of the literature review presented below.

Papers on the original Chialvo-Bak model

The Chialvo-Bak model was introduced by P. Bak and D. Chialvo [8] in 1999. In thispaper, the motivations, biological constraints and ground rules are put forward, anda great emphasis is placed on the biological plausibility of the model, which leads torequirements of self-organisation at different levels and robustness to noise.

Self-organised learning and adaptation is required to reflect ability to learn withoutexternal guidance. The apparent lack of information in the DNA to encode the phys-ical properties of neurons and synapses and their connectivity [15] motivates the self-organisation at the connectivity level. Each neuron must learn without genetic orexternal guidance, to which other neurons to connect and this connectivity should re-main flexible in order to adapt to external changes.

The ability quickly recover from perturbations induced by biological noise is a con-straint motivated by the biological reality of the organism.

3


9/55

Learning by synaptic depression (negative feedback) is proposed as the basis of bi-ological learning and adaptation, supported by the following elements:

Long-term synaptic depression (LTD) is as common in the mammalian brain aslong-term synaptic potentiation (LTP) [8]. The LTP mechanism is the suggestedphysiological implementation of learning by synaptic depression.

Learning by synaptic potentiation leads to very stable synaptic landscapes, fromwhich adaptation to new configurations is difficult and slow.

Learning of new tasks or adapting to new environments is error prone, as such, aprocess that acts on errors rather than on what is correct leads to faster learning.

The other pillar of the model, the Winner-Take-All (WTA) rule is inspired frommodels of Self-Organised Criticality [2], as means to drive the system to an adaptivecritical state, where small perturbations can cause very large changes in the synapticlandscape. The WTA rule plays also a key role in the solution to the credit assignmentproblem by keeping the activity of the network low, as detailed in Section 2.1.

The synaptic plasticity changes are driven by a global signal informing on the successof the latest synaptic changes. The ability for the organism to differentiate betweenoutcomes is deemed innate to the system and possibly resulting from Darwinist selec-tion.

A second paper from the same authors [1] was published in 2000 that expanded onseveral key aspects of the model, such as the network topologies, the memory mech-anism and a new learning rule to tackle more complex problems. The performancescaling under the new learning rule was also analysed.

Several network topologies and their relevant learning rules were formally defined andthese are illustrated in Figure 1.1.

(a) The simple layered networktopology, which is the one used forthe present work.

(b) The lattice networktopology, where nodesconnect to a small number of nodes in thesubsequent layer.

(c) The random network topology,where nodes connect randomly to of other neurons. Two subsetsof nodes are selected as the inputand the output nodes of the net-work.

Figure 1.1 The various network topologies proposed in the original model. Figuresreproduced from [1].

4


10/55

The ability to store and retrieve previously successful configurations is enabled by theselective punishment of synaptic weights, where previously successful weights are de-pressed less severely when no longer leading to the correct mappings.

A small modification to the basic learning rules enables the network learn non-linearproblems such as the XOR problem or more generally the generalised parity problem,

where the parity of any number of input neurons must be correctly calculated.

The learning of multistep sequences is introduced, where the depression of weightsrelated to bad sequences only occurred at the end of the last step. The generalisationand feature detection capability is also covered. The network is able to differentiatebetween classes of inputs requiring the same output by identifying useful features inthe inputs.

In [20], J.R. Wakeling identified an order-disorder phase transition in the model thatis regulated by the ratio of middle layer to input and output nodes. At the phasetransition, the network displays power-law behaviour with exponent = 1.3 for the

learning time distribution.

These order-disorder regimes are characterised by the frequency of path interferenceevents, where already learnt mappings are accidentally destroyed while learning newmappings. The disordered phase is characterised by a high probability of interference.

In [21], J. Wakeling investigated the performance of synaptic potentiation and selectivepunishment under different mapping change modes. In the slow change mode the net-work is made to learn a succession of mapping sets that only differ by one input-outputmapping. In the mode two different mapping sets are presented alternatelyto the network.

Synaptic potentiation was introduced to the plasticity rules by rewarding successfulweights by an amount while still punishing unsuccessful weights. A quantitativeanalysis showed that any small amount of synaptic potentiation resulted in higher av-erage learning time, specially in the slow change mode. This is illustrated in Figure1.2a.

The performance of selective punishment was also investigated. While no visible im-provement was detected in the slow change mode, in the flip-flop mode the mechanismwas very effective, as illustrated in Figure 1.2b.

Chialvo-Bak model extensions

In [7], R.J.C. Bosman, W.A. van Leeuwen and B. Wemmenhove propose an extensionto the Chialvo-Bak model by including the potentiation of successful weights. Thisenables faster single mapping learning and multi-node firing in each layer, but is at thecost of the adaptation performance.

In [16], K. Klemm, S. Bornholdt and H.G. Schuster propose a stochastic approximationto the WTA rule and include a forgiveness parameter in order to only punish thoseweights that are consistently unsuccessful. This extension results in a more complexmodel that according to the learning performance analysis in [1] does not improve on

5


11/55

(a) The learning performance of synaptic po-tentiation in the slow change mode, where onlyone mapping is changed at a time.

(b) The learning performance of synaptic po-tentiation in the flip-flop mode, where the net-work is presented alternately with two differentmappings.

Figure 1.2 The learning performance decreases with any amount of synaptic potentiationfor the successful weights. Reproduced from [21].

the original model.

6


12/55

Chapter 2

The Two-Layer Topology

This chapter describes the functioning and the properties of the Chialvo-Bak modelthat are most relevant to the research results presented in Chapter 3. As such, it doesnot comprise an extensive description to the model, for which the reader is best directedto the original papers by P. Bak and D. Chialvo [8] [1].

The contents of this chapter are based in the above two papers and in the paperpublished by J.R. Wakeling [20] on the order-disorder phase transition of the model.

2.1 Basic Principles and Learning

The Chialvo-Bak model is characterised by the following principles:

Winner-Take-All dynamics: Neural activity only propagates through thestrongest synapses.

Learning by synaptic depression (negative feedback): Synaptic plasticityis exclusively applied by the weakening of synaptic weights that participate inwrong decisions. These synapses are depressed.

These principles define the node activation and plasticity rules of the network, there-fore determining the dynamics and properties of the model. To illustrate this, thefunctioning of a Chialvo-Bak network while learning an arbitrary input-output map-ping is presented below.

Consider a neural network with one input layer, one middle layer and one outputlayer, as illustrated in Figure 2.1. Each layer has , and nodes respectively andthe network is noted (, , ).

The nodes connect between layers with synaptic weights , as follows:

Input nodes connect to all middle nodes with weights (, ).

Middle nodes connect to all output nodes with weights (, ).

The network is initialised with random weights in [0, 1].

Each node can be active or not active, corresponding to node state 0 or 1. The activationof an input layer node results in the activation of one middle layer node and one outputlayer node, according to the following Winner-Wake-All (WTA) rule:

7


13/55

Figure 2.1 The two-layer network with three input nodes, four middle layer nodes andthree output nodes, i.e. = 3, = 4 and = 3. Each input node is connected toall nodes in the middle layer, and each middle layer node is connected to all nodes in theoutput layer.

Input node activates the middle layer node with maximum (, ).

Middle node activates the output node with maximum (, ).

In biological terms, the WTA rule could be implemented using lateral inhibitory con-nections within the same layer and excitatory connections between layers.

The activation sequence above ensures that no directed cycles are possible betweennodes, qualifying the network as a feed-forward network.

For a given a set of weights, the WTA rule determines the sequence of activation inthe middle and output layers, which defines the active configuration of the network.An active configuration example is shown in Figure 2.2, where the blue connectionsrepresent the winning weights according to the WTA rule.

Figure 2.2 The Winner-Take-All rule (WTA) specifies the active configuration ofthe network. In the above graph all input nodes are shown active, whereas in the networkonly one input node is active at the time. In this example = {{1, 1, 1}, {2, 2, 3}, {3, 4, 2}}and corresponds to the input-output mapping set = {1, 3, 2}. The blue connectionsrepresent the active weights of the configuration .

An active configuration associates input nodes to middle layer nodes, which in turnare associated to output nodes. As such, each maps the input nodes to output nodes:for each input node corresponds a mapping to the output node = ().

8


14/55

The mapping set contains the mappings of all the input nodes of the network.

In such terms, learning an arbitrary mapping set corresponds to the evolution ofsynaptic weights from an initial active configuration to the final active configuration that yields the required mapping set .

The network learns an arbitrary mapping set by applying the following synapticplasticity rules:

1. A random input node is selected.

2. The input node fires and activates a middle layer node and output layer node according to the WTA rule.

3. If output node is correct, i.e. = (), return to step 1.

4. Otherwise depress the active weights (, ) and (, ) by a random amountin [0, 1] and return to step 2.

A sequence of learning steps from = {1, 3, 2} to the identity mapping set ={1, 2, 3} is illustrated in Figure 2.3.

Figure 2.3 The learning of the identity mapping set = {1, 2, 3} from the initial map-ping set = {1, 3, 2}. In this example, is learnt in three depressions: one depressionto learn input node 2 (upper row graph sequence) and two depressions to learn input node3 (lower row graph sequence). The blue connections represent the active weights of theconfiguration and the orange connections represent the depression of active weights.

A weight normalisation can be applied at the end of step 3 of the plasticity rules, byraising the weights of input node and middle layer node such that the winningweights are equal to one.

9


15/55

The step 2 of the plasticity rules requires a feedback signal informing the suitabil-ity of the recent changes. This is provided in the form of a global feedback signal thatis broadcast to the entire network, in the case the latest changes are not satisfactory.

The synaptic plasticity rules could correspond to the following events at the biolog-ical level:

1. Depressing the current active level of an input node results in a new active levelthat is tagged for recent changes. This tagging takes the form of a chemical orhormone release that is triggered by the latest synaptic activity.

2. No further plasticity changes take place until a global feedback signal is received.This signal is broadcast to the entire network informing whether the latest changesare not satisfactory.

3. Following an unsuccessful global feedback signal, the step 1 is repeated. Nofurther actions are taken otherwise.

The synaptic plasticity rules result in the following properties:

For the global feedback mechanism to be efficient in directing synaptic learning,the rate of plasticity change has to be sparse. In such conditions, the creditassignment problem [18][3] is solved, i.e. the system can determine which elementsare to be punished following bad performance.

The network signalling is in the time scale of firing patterns (i.e. milliseconds),while the tagging and feedback mechanisms are in a timescale more adapted tothe scale of events in the external world (i.e. seconds to hours).

The global feedback signal represents an external critic rather than a teacher, asno specific instructions are provided to direct the plasticity activity.

For the network to learn a random mapping set , the middle layer size must be atleast be as large as the input layer size, i.e. , so that each input node canhave a dedicated path to the corresponding output node = ().

A network with input nodes, middle layer nodes and output nodes is noted(, , ).

2.1.1 Interference events

In the process of learning a new mapping an interference event may occur, where the

network unlearns a previously learnt mapping.

This is the case whenever while learning a new mapping, a middle layer node thatwas establishing a correct mapping for another input node is selected (assuming thecorrect output node for these input nodes is different). This is illustrated in Figure 2.4.

2.1.2 Synaptic Landscape

An interesting consequence of learning by synaptic depression is the resulting synapticlandscape, shown in Figure 2.5.

10


16/55

Figure 2.4 While learning a mapping the network may unlearn a previously correctmapping. In the above sequence, the learning of input node 3 lead from = {1, 2, 2}to = {1, 1, 3}. As such, the net number of learnt mappings remained unchanged: theoutput mapping of input node 3 was learnt and the output mapping of input node 2 wasunlearnt.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Node 1 to Middle Layer

(6,108,6) [runs:1e+5 rand]

Weightw

Node Index

(a) The synaptic weights from input node oneto the middle layer, in a network (6, 108, 6).

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025

Weight Distribution p(w)

(32,1024,32) [runs:1e+5 bins:100 rand]

p(w)

Weight w

(b) The synaptic weights distribution (100bins) in a network(6, 108, 6).

Figure 2.5 The metastable synaptic landscape is a direct consequence of learning bysynaptic depression and supports the fast adaptation property of the model.

In Figure 2.5a the metastable nature of the synaptic landscape is apparent, with theactive configuration barely supporting the current mapping. This is to be contrastedwith the synaptic landscape resulting from learning by synaptic potentiation, whichoften results in a small number of dominating synaptic weights.

In this model, learning a very different mapping set is often just a few depres-sions away from the currently active weights.

The particular form of the weights distribution in Figure 2.5b is due to both the WTArule and learning by synaptic depression. As the active synapses for a given input ormiddle layer node are depressed by a random amount, the WTA rule will select thesynapse with the current highest weight1 for the new active configuration in each layer.This amounts to shifting all the weights by the amount of the difference between theprevious highest weight and the new highest weight.

1The highest weight after depression may still correspond be the previous winning weight, but theprobability of re-selection is lower than for any another weight.

11


17/55

Starting from an uniform weights distribution and repeating the above process a suf-ficient number of times, yields the distribution in Figure 2.5b. The intermediate stepsof this process are illustrated in Figure 2.6.

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025


(32,1024,32) [runs:8 bins:100 rand]

p(w)

Weight w

(a) The synaptic weights distribution afteradapting to eight successive random mappings,

in a network (32, 1024, 32).

0 0.2 0.4 0.6 0.8 10

0.005

0.01

0.015

0.02

0.025


(32,1024,32) [runs:15 bins:100 rand]

p(w)

Weight w

(b) The synaptic weights distribution afteradapting to 15 random mappings, in a network

(32, 1024, 32).

Figure 2.6 The synaptic weights distribution evolves from a uniform distribution at theinitialisation of the network, to the distribution in Figure 2.5b.

2.1.3 Neural avalanches

The learning performance of the network can be measured in the number of depressions required to completely learn a given mapping set . This quantity will be looselyreferred to as the learning time, although no particular timescale is thereby implied.

The learning performance is known [8] to improve with increasing middle layer sizes,as illustrated in Figure 2.7. This is an advantage to regular back-propagation learningwhere in general the performance decreases with increasing middle layer size.

Let be the random variable associated with the number of depressions required tolearn a mapping and let ( = ) () be the probability of learning mapping in depressions.

The learning performance of the network (, , ) is completely determined bythe learning time distribution, characterised by the probability mass function suchthat

() = ( = ) (2.1)=0

() = 1. (2.2)

The basic operation for measuring () is to record the number of depressions re-quired to learn the current mapping set , increase by one unit the count of mappingset learning in depressions, present a new mapping set to the network and so on.However, certain aspects in the setup of the simulations have a noticeable impact onthe measured values, as such these will be discussed in greater detail below.

12


18/55

20 40 60 80 1000

50

100

150

200

250

Average Learning Time

(6,*,7) [run:50*1e+4 slow]

Middle Layer Nodes

(a) The average number of depressions re-quired to learn a random mapping for differ-ent sizes of the middle layer.

20 40 60 80 1000

5

10

15

20

25

30

Average Interference Events

(6,*,7) [run:50*1e+4 slow]

Middle Layer Nodes

(b) The average number of interference events while learning a random mapping fordifferent sizes of the middle layer.

Figure 2.7 Increasing the number of middle layer nodes decreases the average learningtime and the average interference events . The plots show and for networkswith six input nodes, seven output nodes and varying number of middle layer nodes.

One could require the weights of the network to be reset prior to learning the nextmapping but this would lead to measuring the first mapping learning times. Instead,the weights of the network are not reset at each new mapping set, and this yields ameasure that is closer to on-line learning performance.

A further distinction can be made on the degree of similarity between the new mappingset being presented to the network and the previous one. These can be completely ran-dom or differing in a small number of mappings only. Borrowing from J.R. Wakeling

[20], the slow mapping change mode corresponds to one single mapping change2 in thenew mapping set. The random mapping change mode corresponds to random mappingsets being presented to the network.

The distribution () for a network (8, , 9) is shown in Figure 2.8a. The tail ofthe distribution (i.e. long learning times) recedes for larger middle layer sizes, which isconsistent with the plot in Figure 2.7.

An interesting aspect of the model is the power law tail of () [20], as illustratedin Figure 2.8b. The power law tail is a telltale sign of scale-free behaviour, for whichno single value of is typical for the learning time in those networks. Shorter values

of occur more frequently than longer ones, but the later occur frequently enough tonot being singled out as exceptional.

The power-law tail of () can be understood as avalanches of activation in the middlelayer nodes. Borrowing terminology from statistical physics, three operating regimesare then identified:

Sub-critical regime for


19/55

101

102

103

104

104

102

100

Learning Time Distribution p()

[runs:1e+6 slow]

p

()

(8,36,9)

(8,72,9)

(8,144,9)

(a) The learning time distributions for severalmiddle layer sizes of a network with eight inputand eight output nodes. Data from1 + 6 runs.

101

102

103

104

106

104

102

100


[runs:1e+6 slow]

p

()

(8,72,9)

(16,272,17)

(32,1056,33)

(64,4160,65)

(b) The learning time distributions for severalnetworks with critical number of middle layersize. Data from 1 + 6 runs.

Figure 2.8 The learning time distributions reveal three distinct regimes: sub-critical,critical and super-critical. The critical regime exhibits power law behaviour with () 1.3 according to [20].

Critical regime for

Super-critical regime for >>

In [20], J.R. Wakeling proposed that the power law tail of () corresponds to anorder-disorder phase transition in the model and that the key difference in the learningdynamics for the three operating regimes is the interference probability ():

In sub-critical networks, there are enough middle layer nodes for interferenceevents to be quite rare and therefore learning is very quick.

In super-critical networks, there are hardly enough middle layer nodes to learnwithout inducing interference events and therefore learning is extremely slow.

The learning dynamics for critical network sizes is in-between the other tworegimes with just enough interference to occasionally cause large learning timeswhile most of the time the learning times are quite fast.

However, it should be noted that the model has not been proved to be critical in theproper statistical physics sense, in order to merit such terminology. Assessing the crit-icality for the model in the two-layer topology is certainly challenging.

Furthermore, the approximately straight segments in the distributions of Figure 2.8b donot necessarily imply that () is a proper power law tail distribution, as very clearlyexplained in the paper [9] by Clauset, Shalizi and Newman. Straight segments in alog-log plot are a necessary but not sufficient condition for () to be a power law taildistribution. Due to timing constraints however, no conclusive power law testing wascompleted for () and in consequence, the terminology proposed in [20] is adoptedthrough the document.

2.1.4 Summary

The key elements of this section are the following:

14


20/55

Network dynamics: Defined by Winner-Takes-All dynamics and learning bysynaptic depression (negative feedback).

Input-output learning: The network is able to learn an arbitrary mapping set where for each input node corresponds an output node = ().

Local flagging mechanism: Plasticity changes are locally marked for recent

activity.

Global feedback mechanism: Feedback is provided in the form of a globalfeedback signal specifying whether the most recent changes are unsatisfactory.

Solution to the credit-assignment problem: Requires sparse network activ-ity, so that no plasticity changes occur until a global feedback signal is received.

Two typical timescales: The network signalling occurs in the time scale ofthe firing patterns, while the tagging and feedback mechanisms occur in a muchlonger timescale that is relevant to the scale of events in the external world.

Interference event : The learning of input-output mappings can be disruptedby the unlearning of previously learnt mappings.

Metastable synaptic landscape: The active configuration is barely supportedby the winning weights.

Neural Avalanches: For middle layer sizes = the network displayspower-law behaviour in the distribution of the learning time ().

2.2 Storing Mappings

The plasticity rules introduced in Section 2.1 enable the network to learn a randommapping set , and quickly adapt to another mapping set whenever needed. Notmuch information is left [1] in the synaptic weights to reliably retrieve at a laterstage, since active weights that supported were depressed3 a random amount in [0, 1]to support the new mapping set .

An additional mechanism is therefore required to store the information from previ-ously learnt mapping sets for later recall. It turns out that such a mechanism existsand amounts to depressing less severely the weights that have successful in the past,and is called the selective punishment rule [8][1].

The selective punishment rule requires small modifications to the plasticity rules toenable the distinction between successful and unsuccessful weights:

1. A random input node is selected.

2. Input node fires and activates a middle layer node and output node accordingto the WTA rule.

3. If output node is correct, i.e. = (), tag the weights (, ) and (, ) assuccessful and return to step 1.

3More specifically, the active weights that are not shared by and .

15


21/55

4. Otherwise depress the active weights (, ) and (, ) by:

A random amount in [0, 1] if(, ) and (, ) have never been successful.

A random amount in [0, ] otherwise.

Return to step 1.

In the Chialvo-Bak model, recalling a mapping set refers to a different operationthan in other neural network models. Since synaptic plasticity is required to retrievethe information stored in the synaptic weights, the network is rather re-adapting to apreviously seen mapping set, than recalling the mapping set. Nevertheless, in order todistinguish from the learning rules without selective plasticity, the term recall will beused.

0 10 20 30 40 500

100

200

300

400

500

Recall Time

(6,36,6) [runs:50 rand]

Recall

M

1M

2

M3

M4

(a) Example of learn and recall performancewithout selective punishment.

0 10 20 30 40 500

100

200

300

400

500

Recall Time

(6,36,6) [runs:50 select rand]

Recall

M

1M

2

M3

M4

(b) Example of learn and recall performancewith the selective punishment rule.

Figure 2.9 The number of depressions required to first learn and then recall fourrandom mapping sets 1, , 4. The network is presented with the mapping in randomsuccession, and value of is recorded for each graph, i.e. at recall = 10 the network hasseen each mapping set 10 times.

The selective punishment rule results in a dramatic performance increase (under therandom mapping change mode), as shown in Figure 2.9. This performance increaseresults from the network establishing preferred paths from each input node to the out-put nodes required by the mappings being presented. These preferred paths are thefirst to be queried when the active configuration is no longer correct. A detailed ex-ample of the selective punishment dynamics is presented in the Appendix of this section.

The weights tagged by the selective punishment rule are constrained to a region4 withina distance from unity, as shown in Figure 2.10. This region is referred to as the -band.

2.2.1 Summary


4The uniform distribution of weights in the in Figure 2.10 results from depressing thetagged weights a fixed amount rather than a random amount in [0, ]. In the later case, the resultingdistribution of weights in the -band would be similar to that of Figure 2.5b.

16


22/55

0.998 0.9985 0.999 0.9995 10

0.005

0.01

0.015

0.02

0.025


(32,1024,32) [runs:1e+5 bins:5e+4 select rand]

p(w)

Weight w

Figure 2.10 The weights tagged by the selective punishment rule are kept within the band located in [1 , 1]. This is where the memory of previous successful mappings isstored. Discrete distribution of 500 bins.

Selective punishment rule: Recalling previously learnt mapping sets is enabledby depressing weights that haven been successful in the past by a smaller randomamount when no longer leading to the desired mapping.

Selective punishment dynamics: The performance increase results from thenetwork establishing preferred paths from input nodes output nodes as requiredby the learnt mapping sets. On average these preferred paths are queried muchmore often.

Delta band: contains the weights representing the memory of previously suc-cessful mappings and is located at [1 , 1].

2.2.2 Appendix

The example from Figure 2.9 will be used to illustrate the dynamics of selective pun-ishment. Suppose that 1, , 4 are presented to the network in that order andrequire input one to activate output nodes {1, 3, 6, 6} respectively.

After learning the mapping from input one to output node required by 1, the winningweights (1, 1) and (1, 1) are tagged by the selective punishment rule.

When presented with 2, input node one should now activate output node three andthe weights (1, 1) and (1, 1) are depressed accordingly. This results in a seriesof successive depressions to bring these weights slightly below the respective second

highest weights.

This succession of depressions is necessary since the average weight distance for thisnetwork is 1/ 0.03 for the weights set (, ) and 1/ 0.14 for the weights set(, ), whereas (1, 1) and (1, 1) are now depressed by small random amountsin 0.001. This accounts for the relatively higher adaptation values for learningmappings 2, 3 and 4 in Figure 2.9b, when compared to Figure 2.9a.

This also illustrates the negative performance impact that synaptic potentiation wouldhave in this network, since it would lead to large weight differences (in units of depres-sion amounts) between the active weights and the other weights. A metastable synaptic

17


23/55

landscape, such as the one illustrated in Figure 2.5a, is a requirement for the networkto quickly converge to a new mapping sets.

Eventually, either (1, ) or (1, 1) will be depressed below the other weights.Supposing that (1, ) is first, then node one switches from middle layer node 1to another middle layer node , which has 1/ probability of activating output node

three. If that is the case, the input node one has learnt the correct mapping for 2and the weights (, 1) and (3, ) are also tagged as successful.

If middle layer node does not lead to output node three, it is depressed accord-ingly. Node one is then very likely to switch back to middle layer node 1, whichwill still activate output node one, unless weight (1, 1) is already the second high-est weight of node one, where it then has a chance of activating a different output node.

If output node three is still not found after a few more depressions, the search se-quence will now alternate between middle layer node 1 and other middle layer nodes,and the output node of middle layer node 1 will alternate between output node one

and the other output nodes.

After learning the mapping sets 1, , 4 for the first time, each input node hasone or more preferred paths formed by the pairs of tagged weights that lead to therequired output nodes. The network will pool these weights much more frequently.

The example above can be easily generalised to other input nodes and to the weight(1, 1) reaching the second highest weight before weight (1, 1) does.

A sample run where the mapping sets 1, , 4 required input one to activate out-put nodes {1, 3, 6, 6}, resulted in the preferred paths for input node shown in Table 2.1:

Preferred paths from input 1

To middle layer node From middle layer node to output node

3 612 2, 630 631 134 3

Table 2.1 The successful weights tagged by the selective punishment rule result in a setof preferred paths for input node 1.

The preferred path to output node two, which is not required for input node one, wasadded by input node two when learning mapping set 4.

2.3 Advanced Learning

The type of learning problems that the model can tackle so far are better described assolving a routing problem: given a mapping set , the input nodes have to find pathsto the output node ().

18


24/55

A more advanced type of learning consists in considering mapping sets betweeninput node activation patterns and the specific activation of output nodes = ().As before, the state of each input node can be active = 1 or inactive = 0 and theentire configuration of input nodes is represented by a binary vector . For example, = {1, 0, 0, 1, 0, 1} is an activation pattern for a network with six input nodes.

Learning the basic Boolean functions = {AND, OR, XOR, NAND, } is a par-ticular example of this type of learning. The logical value of proposition and isrepresented by the state of two input nodes and the logical value of the function (, )is represented by the activation of one of the two output nodes.

The changes to the plasticity rules that are required to learn this type of problemare surprisingly small and amount to a slightly modified Winner-Take-All rule:

Input configuration = {1, , } activates the middle layer node withmaximum =

(, ) ().

Middle node activates the output node with maximum (, ), as before.

A bias input node that is always active is necessary in order to compute the state wherethe remaining input nodes are inactive.

An example of the network solving the XOR problem under the above plasticity rulesis shown in Figure 2.11. An example of weights that implement a solution to the XORproblem are shown in Table 2.2 of the Appendix of this section.

Figure 2.11 An example active configuration implementing the XOR truth table.

The network can learn the XOR problem with three middle layer nodes. In general, itcan learn any mapping with = 2

middle layer nodes, by discovering for eachinput pattern the corresponding middle layer node pointing to the correct output(). As there are as many middle layer nodes as input configurations the learningconvergence is guaranteed. As can be appreciated from the XOR example, the network

can learn with fewer middle layer nodes but the exact minimum is dependent on

.

2.3.1 Summary


Advanced learning capability: The model can learn the mapping of inputnode configurations , representing the activation state of the input nodes, andthe respective output nodes = (). In particular the basic Boolean functionscan be learned.

19


25/55

Advanced learning plasticity rule: The middle layer node with the maximumweighted sum of weights (, ) is selected and activates the output node asbefore.

2.3.2 Appendix

An example of weights that implement a solution to the XOR problem are shown inTable 2.2.

Input to middle w(m,i) Middle to output w(o,m)

1 1 0.5 1 1 0.11 2 0.4 1 2 0.21 3 0.1

2 1 0.3 2 1 0.42 2 0.7 2 2 0.32 3 0.9

3 1 0.2 3 1 0.53 2 0.6 3 2 0.6

3 3 0.8

Table 2.2 Example weights to solve the XOR problem under the advanced learningplasticity rules.

20


26/55

Chapter 3

Research Results

This chapter presents the results of the research that was conducted during this MScproject.

3.1 -band Saturation

The selective punishment rule enables the network to quickly re-adapt to previouslylearnt mappings by depressing less severely weights that were successful at least oncein the past. It is also know that the performance of this mechanism has an ageing effectat large time scales [1], as illustrated in Figure 3.1.

0 10 20 30 40 500

100

200

300

400

500

Recall Time


Recall

M1

M2

M3

M4

(a) At short time scales the selective punish-ment rule drastically improves the ability toquickly recall previously learnt mapping sets.

0 500 1000 1500 20000

100

200

300

400

500

Recall Time


Recall

M1

M2

M3

M4

(b) At long time scales the recall performancedegrades progressively.

Figure 3.1 The ageing effect on the selective punishment performance at large timescales.

As before, the term recall will refer to the re-adaptation of the network to a previouslylearnt mapping .

The performance degradation of selective punishment is caused by the saturation ofthe -band, which is the region within a distance from unity where the weightstagged as successful are constrained to be.

21


27/55

For the selective punishment rule to be effective, each input node should be able toquickly sort through the tagged weights to recover the preferred path to the correctoutput node. Ideally, each input node would have established one preferred path foreach previously learnt output nodes. As such, it would take a number of depressionsin the order of the number of preferred paths to find the correct output node.

On the other hand, if the number of preferred paths leading to the same output nodegrows further, the advantage of the tagging mechanism to identify the preferred middlelayer nodes ceases to be effective.

The increase of paths leading to the same output node is a consequence of all weightseventually being given a chance to participate in correct mappings. Consider the con-tinuous raising of the weights caused by the depressions of the plasticity rules. Eachraising step is the difference between the highest weight and the second highest. Even-tually, all weights end-up in the -band and are soon able to compete with the taggedweights for participating in a correct configuration. Once that occurs, one additionalpath to the output node is created.

The monotonous increase of tagged weights leads to a saturation of the -band, asincreasingly more weights are confined in that region of weight space. This effect canbe appreciated in Figure 3.2a and the corresponding increase of recall times is illustratedin Figure 3.2b.

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

band Saturation

(6,36,6) [runs:100x250 map:128 rand]

TaggedWeig

hts(%)

Recall

(a) The number of tagged weights increasesmonotonically and leads to a saturation of the-band.

0 50 100 150 200 2500

50

100

150

Average Recall Time

(6,36,6) [runs:100x250 map:128 rand]

Recall

not selective

selective

(b) The performance of the selective punish-ment rule degrades with successive recalls.

Figure 3.2 As the percentage of weights tagged as correct increases, the recall timesapproach the performance of the regime without selective punishment.

3.1.1 Desaturation strategies

The monotonous increase of tagged weights is a consequence of the permanent taggingof the selective punishment rule. As such, a mechanism is required to reduce the tag-ging lifespan.

P. Bak and D. Chialvo proposed [1] a mechanism of neuron ageing to tackle this issue,where nodes are replaced at a fixed rate, their weights randomised and the tagging

22


28/55

information removed. However, the neuron replacement rate may be dependent on thelevel of network activity, in order to successfully counter the saturation rate of the-band.

Several strategies that result in non-permanent tagging were reviewed and the firstof them was selected for implementation:

Global tag threshold: weights are untagged if not successful for more than aglobal threshold of depressions.

Local tag threshold: as the previous, but the threshold depends on the pastperformance of each weight.

Interference correction: untagging weights after a threshold number of inter-ference events.

3.1.2 Global tag threshold

The global tag threshold has been investigated in greater detail, and numerical simu-lations suggest that an optimum threshold value exists for each network size.

In order to ensure that the optimal tag threshold value does not depend on the activitylevel of the network, an increasing number of mappings was presented to networks ofthe same size. The performance of the optimal tag threshold value was consistent acrossthe number of presented mapping sets, as illustrated in Figure 3.3. For the network(6, 36, 6) the optimal tag threshold value is close to 48.

101

102

103

0

0.2

0.4

0.6

0.8

1

Average band Saturation

(6,36,6) [runs:50x250 rand]

AverageTaggedWeights(%)

Mappings

16

32

4864

80

(a) The average saturation of the -band for

global tag threshold values are consistent acrossthe number of presented mapping sets.

101

102

103

0

50

100

150

200

Average Recall Time

(6,36,6) [runs:50x250 rand]

Mappings

16

32

48

64

80

(b) The average recall time performance for

global tag threshold values are consistent acrossthe number of presented mapping sets.

Figure 3.3 For the network (6, 36, 6) the optimal global tag threshold value is close to48.

For networks with larger number of input nodes the optimal tag threshold value isalso higher. This was verified for the network (12, 144, 12), where the optimal tagthreshold values is around 64. This is illustrated in Figure 3.4.

The optimal tag threshold seems closely related to an optimal number of average tagged

23


29/55

101

102

103

0

0.05

0.1

0.15

0.2

Average band Saturation

(12,144,12) [runs:50x250 rand]

AverageTaggedWeights(%)

Mappings

3248

6480

(a) The average saturation of the -band forglobal tag threshold values.

101

102

103

50

100

150

200

250

300

Average Recall Time

(12,144,12) [runs:50x250 rand]

Mappings

32

48

64

80

(b) The average recall time performance forglobal tag threshold values.

Figure 3.4 For the network(12, 144, 12) the optimal global tag threshold value is around64.

middle layer nodes and tagged output nodes behind them, as illustrated in Figure 3.5.

Tag threshold values that are too low result in the network forgetting successful nodestoo fast, as can be observed by the sharp decrease in the average recall time inFigure 3.5. Tag threshold values that are too high fail to get rid of path redundancyfast enough. This results in an increase of the average recall time , as illustrated inFigure 3.4b for the tag threshold value of 80, for example. Somewhere in-between liesthe optimal value.

The blue line in Figure 3.5b represents the average number of tagged middle layernodes per input node. From Table 3.1, the number of tagged middle layer nodes sur-

passes the number of output nodes from tag threshold 32 and above.

The green line of the same graph represents the average number of tagged outputnodes per tagged middle layer node. This value decreases until reaching tag threshold32, where the network compensates for the lack of enough tagged middle layer nodes,by tagging more output nodes per tagged middle layer node. Near tag threshold 32 aminimum is reached, and starts growing again as higher tag threshold values cannotget rid of path redundancy fast enough.

The best tag threshold found for this network produced an average of 7.9 tagged middlelayer nodes, which is nearly two nodes more than the number of output nodes . Thefull results are presented in Table 3.1 in the Appendix of this Section.

3.1.3 Summary


-band saturation: The permanent tagging of successful weights results in themonotonous increase of weights being constrained to the -band region, thereforereducing the performance advantage of the selective punishment rule.

Desaturation strategies: Several desaturation strategies are possible and amount

24


30/55

Global Threshold Average Middle Layer Average Output Layer

16 4.8350 1.544724 5.3833 1.512932 6.0450 1.411840 6.9167 1.413548 7.9133 1.5362

56 9.3767 1.699864 10.9283 1.937272 12.4167 2.106680 14.4617 2.3799

Table 3.1 Average number of tagged middle layer nodes per input node and averagenumber of tagged output nodes per tagged middle layer node for different values of theglobal tag threshold in the network (6, 36, 6).

10 20 30 40 50 60 70 800

50

100

150

200

Average Recall Time

(6,36,6) [runs:100x250 map:128 rand]

Global Tag Threshold

(a) The average recall time decreasessharply until reaching the optimal global tagthreshold value and slowly starts increasingagain beyond that point.

10 20 30 40 50 60 70 800

5

10

15

20

25

Average Tagged Nodes (MiddleOutput)

(6,36,6) [runs:100x250 map:128 rand]

Average

Nodes

Global Tag Threshold

middle

output x10

(b) The average number of tagged middlelayer nodes steadily increases with increasingglobal tag threshold values. The average num-ber of tagged output nodes per tagged middlelayer node has a minimum when the aver-age tagged middle layer nodes are equal to thenumber of output nodes.

Figure 3.5 The relation between the optimal global tag threshold and the average numberof tagged middle layer nodes and average number of tagged output nodes per tagged middlelayer nodes.

to capping the tagging lifetime.

Global tag threshold: Sets a global limit on the number of times a tagged

weight can be wrong before becoming untagged. There is an optimal tag thresholdfor each network size that is independent of the level of network activity.

Global tag threshold dynamics: The optimal value of the global threshold isrelated to the average number of tagged middle layer nodes and average numberof tagged output nodes behind them.

25


31/55

3.2 Markov Chain Representation

A Chialvo-Bak network can be represented by a first-order Markov chain when consid-ering the evolution of the network as a sequence learnt mappings states. Such repre-sentation is useful to derive several statistical properties of the model analytically, suchas the average learning time , the learning time distribution () and the average

interference events .

The Markov chain representation seems appropriate since the evolution of the net-work is to a large extent stochastic. The random amounts of the depression from theplasticity rules result in changes to the active configuration , which is the basic macro-scopic state of the network. Assuming the transition between active configuration toa new active configuration is stochastic and only depends on , then the evolutionof the network can be described by a first-order Markov chain1.

Rather than considering the evolution between active configurations , a more mean-ingful basic state of the Markov chain representation is the learning state of the

network, i.e. the number of currently learnt mappings. For each active configuration the learning state is determined by counting the number of learnt mappingsof . As such, the correspondence between and maintains the Markov chainproperties mentioned above.

The Markov chain has + 1 states, noted 0, 1,..., , corresponding to learningfrom zero up to mappings.

In this context, learning a new mapping set corresponds to the Markov chain start-ing from an initial state and evolving towards the final state . The chain has atransition from to +1 when an additional output node is learnt and from to

1 in the case of an interference events. This is illustrated in Figure 3.6.

Figure 3.6 The Markov chain representation considers the evolution of the network interms of the number of learnt mappings. In this example, the network started in the state2 and successfully reached the final state 3 after two depressions. The evolution sequencewas 2 2 3. The target mapping set is = {1, 2, 3}

The final state is special since the Markov chain stops when arriving to state

1A -order Markov chain would depend on the previous steps.

26


32/55

and no transitions from to any other states is possible. This corresponds to thenetwork having fully learnt the mapping set and therefore no further depressionsare necessary. The fact that all states can reach the absorbing state and that notransition is possible from to any other state qualifies the Markov chain as an ab-sorbing Markov chain.

In general, there are four possible transitions from a given state . Noting () thestate of the system at evolution step , if() = then (+1) {1, , +1, +2}.In Figure 3.7 is shown an example of transition +2.

Figure 3.7 The network can learn up to two mappings after one depression giving thetransitions +1 and +2, and it can unlearn one single mapping giving 1.

For networks with three input nodes, all possible transitions are shown in Figure 3.8,where the arrows indicate the sense of the transitions.

Figure 3.8 All the possible transitions for a network with three input node is shownabove. The arrows indicate the sense of the transitions. No transition is possible from 3

to any of the other states.

A Markov chain is completely determined by the state transition matrix A, which inthe columns specifies the transition probabilities between states , and the initial stateprobability vector p, which specifies the initial state probabilities ((0)).

The element of the state transition matrix A is the probability () ofthe transition . The element of the column vector p is the probability ofstarting the chain in state ((0) = 1).

27


33/55

For the elements of A and p to represent valid probabilities the following must hold:

=0

= 1 for any column index of A (3.1)

=0

= 1 (3.2)

For a network with three input nodes, A and p have the following form:

A =

00 01 0 010 11 12 020 21 22 0

0 31 32 1

and p = (0 1 2 3),

where 30 = 02 = 0 since the corresponding transitions are not possible (see Figure

3.8).

For example, running this network in the slow change mode where only one mappingis changed each time, corresponds to the initial state probability vector p = (0 0 1 0).

For a general network (, , ), A and p have the following form:

A =

00 01 0 0 0 0 010 11 12 0 0 0 020 21 22 0 0 0 0

0 31 32 0 0 0 0

0 0 42 0 0 0 00 0 0 0 0 0 0...

......

. . ....

......

...0 0 0 0 0 0 00 0 0 4 3 0 0 00 0 0 3 3 3 2 0 00 0 0 2 3 2 2 2 1 00 0 0 1 3 1 2 1 1 00 0 0 0 2 1 1

p = (0 1 ),

with (2) = (3) = = 0 = 0 and +3 = = = 0 since thecorresponding transitions are impossible (for appropriate values of ).

3.2.1 Statistical properties

The state transition matrix A and the initial state probability vector p enable to com-pute the statistics of the Markov chain in a very straightforward way. For detailedanalytical derivations see [13] and [17], for example.

28


34/55

For example, the elements of the -th power of A, noted A, yield the transitionprobabilities to a given state in steps, i.e. is the transition probability fromstate to state in steps.

To motivate the above statement, consider the previous example of the network withthree inputs. The probability (2)(21) of going from state 1 to state 2 in two

steps is computed as follows:

(2)(21) = (21) (11) + (22) (21) + (23) (31)(3.3)

which in terms of the transition matrix elements is written:

(2)(21) = 2111 + 2221 + 2331. (3.4)

The last expression is the product of the second row of A with the first column of A,which is the element (1, 2) of the product A A A2 of A with itself.

An important observation is that the above computation relied explicitly in the defin-ing properties of the Markov chain: step ( + 1) is completely determined from step() and the state transition probability matrix A. This allowed to factor (2)(21)in terms of the accessible intermediate states in Eq. (3.3) and associate the resultingprobabilities to the elements of A in Eq. (3.4). This will be useful when interpretingthe numerical evidence for the Markov chain representation.

The powers of matrix A lead to a straightforward computation of the learning timesdistribution (). Since the last row of A gives the transition probabilities to theabsorbing state in steps when starting from each of the + 1 states, the productwith p gives the probability of reaching the absorbing state in steps when starting

from the initial state distribution p:

( ) = (0 1) A p, 0 (3.5)

The transient statesmatrix T contains the transition probabilities between non-absorbingstates, i.e. all states excluding the absorbing state .

A =

T 0F 1

where F is sub-matrix row of transition probabilities to the final state .

The powers of matrix T are obtained from A and its elements () correspond to thetransition probability in steps from the non-absorbing states to the non-absorbingstate .

A =

T 0

1

where represents the last row of A except the element 1.

The learning times distribution () can be expressed in terms of the sub-matrix T,since the element of the last row of A is equal to one minus the sum of the -th

29


35/55

column of T, i.e. = 1 1

=0 = 1

.

Rewriting the first part of Eq. 3.5 in terms of T yields:

( ) = 1 1 T p, 0 (3.6)

where p = (0 1 1) are the components of p except the probability of starting

in the final state and 1 is the ones column vector of size 1.

An absorbing Markov chain has a number of interesting properties [13]:

The matrix I T has an inverse N = (I T)1, which is called the FundamentalMatrix.

N = I + T + T2 +

The element of matrix N is the expected number of times the chain is in state, when starting in state , before reaching the absorbing state .

The last property yields the expectation of the learning time , since adding the en-tries of column of N gives the expect number of times in all transient state beforereaching the absorbing state , when starting in state .

Therefore,

= 1 N p (3.7)

Higher order moments of the random variable can be obtained from the expressionof the factorial moments [17]:

[ ( 1) ( + 1)] = ! 1 T1 N p

where [] is the statistical expectation.

The expectation of the interference events is easily derived from N, by notingthat the interference probability in a given state is given by the element 1 of A,i.e. () = 1.

Therefore:

= v N p (3.8)

where v = (0 0 1 0 2 1 )

is the column vector with the elements of A corre-sponding to interference events.

3.2.2 Markov chain representation: numerical evidence

To test the validity of the Markov chain representation the following experiment wasperformed:

A test set of networks was selected with input layer size ranging from one totwelve input nodes.

For each input layer size :

30


36/55

The output layer size was + 1.

Three middle layer sizes corresponding to the three regimes (sub-critical,critical, super-critical) were used: sub = 2

critical and

super = critical /2.

Each network was simulated for a large number of times and the following statis-tics recorded:

state transition counts

initial state counts

() learning time distribution

average learning time

average interference events

The state transition and initial state counts are possible to extract since thenetwork evolves in discrete steps and for each discrete step the learning state canbe computed from the active configuration.

The measured state transition and initial state counts are normalised accordingto Eqs. (3.1) and (3.2) yielding2 the maximum likelihood estimators [6] for AMLEand pMLE, respectively.

For each measured AMLE and pMLE the quantities ()pred, pred and pred

are computed from Eqs. (3.6), (3.7) and (3.8) respectively.

The measured values are compared to the computed values using the NormalisedRoot Mean Square Error (NRMSE) and the Kolmogorov-Smirnov statistic:

Normalised root mean square error RMSE[pred] =

(

pred

)2maxmin

Kolmogorov-Smirnov statistic = arg max ( )pred ( )

In order to obtain enough state transition counts for each state to enable a good esti-mation AMLE, the mappings were presented to the network with a uniform initial stateprobability (except for the state ). Under this mapping change mode, the learningtime distribution () has a different shape than under the slow mapping change mode,for example. The Figure 3.9 shows an example of measured learning time distributions

() for two sample networks and the predicted distributions ()pred that are obtainedfrom Eq. (3.6).

The NRMSE for pred and pred was zero across the tested networks. The actual

value was below the numerical accuracy of the experiment as estimated by the inverseof the number of samples in each simulation 1/. Such NRMSE value is perhaps anindication that the underlying method is not applicable to validate the Markov Chainrepresentation.

The Kolmogorov-Smirnov statistic for ()pred computed from Eq. (3.6) is shown inFigure 3.10.

2The derivation of the maximum likelihood estimator for A follows the usual scheme of minimisingthe data log likelihood and applying Lagrange multipliers for each element ofA.

31


37/55

100

101

102

103

104

102

100


(4,10,5) [runs:4e+5 uniform]

p

()

model

markov

(a) Super-critical network (4, 10, 5).

100

101

102

103

104

102

100


(12,312,13) [runs:1e+6 uniform]

p

()

model

markov

(b) Sub-critical network (12, 312, 13).

Figure 3.9 The learning time distribution () for a uniform initial state probability(except for the state ). The black line represents the predicted distribution ()

pred thatis obtained from Eq. (3.6). The green line is the measured ().

0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.06

0.07 KolmogorovSmirnov Statistic

D

Input Layer Size

super

critical

sub

Figure 3.10 The Kolmogorov-Smirnov statistic for ()MLE for the test network set,with networks ranging from two input nodes to 12 input nodes. The statistic is computedfrom Eq. (3.6) and decreases with increasing input layer size for the networks considered.

For the networks considered, the Kolmogorov-Smirnov statistic was usually higherin the super-critical and critical regimes. With increasing input layer size, the statistic decreases and converges for all three regimes. The abrupt reduction for the networkwith two input nodes lead to additional measurements with larger output layers butthe results were very similar.

The value differences in the statistic indicate that the Markov chain representa-tion is more accurate for larger networks. Smaller networks may be better representedby other types of Markov chain or may even not allow a Markov chain representationby not respecting the Markov chain properties, for example.

A direct estimation of the Markov chain order might clarify the above question. Due totiming constraints this was not pursued. Several estimators exist for the Markov chainorder, such as the BIC Markov order estimator [11] or the Peres-Shield order estimator[19].

32


38/55

3.2.3 Analytical solution for (2, , )

An analytical solution for A has been obtained for network with two input nodes(2, , ), for the case where the input nodes cannot share output nodes

3.

For (2, , ) the state transition matrix A and the initial state transition vectorA have the form:

A =

00 01 010 11 0

20 21 1

and p = (0 1 2), (3.9)

Each state can be further separated in sub-states corresponding to the graphs

with degrees of freedom in the middle layer. The basic graphs for (2, , ) areillustrated in Figure 3.11. For simplicity, these sub-states are simply referred as thegraphs of state state .

Figure 3.11 The basic graphs for the two input networks (2, , ). The upper indexspecifies the number of degrees of freedom in the middle layer, i.e. 21 is the graph for onelearnt input node and one shared middle layer node.

The main steps in this computation are the following:

1. Compute the number of active configurations for each state in terms of thegraphs .

2. Equal graph accessibility approximation: assume that in a given state each

of graphs is equally accessible. This is strictly the case for (0) where the

distribution of graphs is only dependent on the relative proportion of active con-figurations implementing each graph. However, it is not necessarily the casefor transitions from another state or from the same state where the graphto graph transitions may favour some particular graphs.

3. For each graph compute the probability of transition to 1, , +1 and+2 corresponding to unlearning, no change, learning one and two mappings,respectively.

3The case where the input nodes can share output nodes may be easily derived by simplification ofthe present results.

33


39/55

When starting the network from random weights, the active configuration distributionfor each state does not correspond to the relative frequency count of its graphs, asshown in Table 3.2.

Graphs

10 11

12

2

0

2

1

2

2

Configuration count

6 6 054 36 6

Configuration count distribution

0.0556 0.0556 00.5000 0.3333 0.0556

Random graph distribution

0.1666 0.1666 00.3749 0.2502 0.0417

Table 3.2 The configuration count distribution and the initial graph distribution whenstarting from random weights do not match. The values above are for the network (2, 3, 4).The configuration count distribution is obtained by counting the relative frequency of theconfigurations implementing a given graph. For example, (2, 3, 4) has six possible config-urations for graph 10 out of 108 distinct configurations, which gives a relative frequency of0.0556 for graph 10 .

The correct distribution is obtained by first counting the configurations generated fromnon-shared middle layer nodes and one of the shared middle layer nodes and thenmultiplying by for each shared additional shared middle layer node. Such strategy is

justified by considering all the combinations of triplets (,,) and enforcing the WTArule in the second and third value of the triplet. The total number of combinations is( )

and the WTA rule assigns them to the corresponding graphs . The resultinggraph distribution is the correct one, as shown in Table 3.3. This approach has alsobeen verified successfully for the network (3, , ).

Random graph distribution

0.1666 0.1666 00.3749 0.2502 0.0417

Predicted graph distribution

0.1667 0.1667 00.3750 0.2500 0.0417

Table 3.3 The initial graph distribution when starting from random weights matches thepredicted graph distribution, for the network (2, 3, 4). The predicted graph distribution isdetailed in the text.

For the network (2, , ) one obtains the following number of configurations per

graph:10 = ( 2) 11 = 20 = ( 1) ( 1)2 21 = ( 1) ( 1)22 = ( 1)where

is the order of the graph , i.e. the number of distinct configurations thatimplement the graph.

34


40/55

This allows to compute prand when starting from a random configuration:

prand =1

(10+ 20 11+ 21 22 )

where 10

+20

+11

+21

+22.

One final element before computing the transition probabilities is to determine theprobability 0 of reselecting the same node after depression and the probability of se-lecting a different node 1 0. The reselection probability is given by 0 = 1/(2 1)and this value can only be related to the particular weight distribution of this model,as shown in Figure 2.5b. Numerical evidence for such value of 0 is shown in Figure3.12.

0 10 20 30 400

0.1

0.2

0.3

0.4

0.5

Average Reselection Probability

(2,*,10) [depressions:1e+4]

P

r(reselection)

Middle Layer Size

reselect

1/(2n1)

1/n

Figure 3.12 The average re-selection probability in the middle layer nodes, for thenetworks(2, , 10). The measured probability (in blue) follows the curve 1/(2 1) (ingreen). The curve1/ is shown for comparison (in red). Data collected for 14 depressions

in each network.

The element 01 of the transition matrix A for (2, , ) can be obtained as follows:

01 = (01) = (10

11) (

11 1) + (

20

11) (

11 1)

Using the equal graph accessibility approximation,

01 (

(10 11) + (

20

11)) 11 11+ 21

which with the graph transition probabilities gives:

01

0(1 0)

2

1+ (1 0)(1 0)

1

11 11+ 21 where 0 and 0 represent the re-selection probabilities of middle and output nodes,respectively.

The main elements of the last computation are detailed below, where it is assumedthe network is learning to connect input two to output two while learning the iden-tity mapping = {1, 2} (the other possible mapping set requiring an inversion of theroles).

35


41/55

11

11 +21 is the equal graph accessibility approximation for graphs 11 and

21 in

state 1.

0(1 0)21

is the transition probability from 11 to 10 . This transition occurs

whenever the same middle layer node is re-selected and the output node is notre-selected. There are 2 out of 1 such cases that enable the transition.

(1 0)(1 0)1

is the transition probability from 11 to 20 , occurring when-

ever the same middle layer node is not re-selected, the output node is not re-selected and there are 1 out of cases that enable the transition.

Figure 3.13 The graph transitions from 11 to 10 and

20 . The target mapping is =

{1, 2}.

3.2.4 Alternate formulation: graph transitions

There is an alternative to relying on the equal graph accessibility approximation. Byconsidering the transitions between graphs directly rather than between states ,the need to compute the graph occupation within states is avoided altogether.

The resulting transition matrix has more elements, one column for each possible graph,but the predictions are potentially more accurate.

The elements of this new state transition matrix A and the new initial probabilitiesvector p, are as follows:

A =

1010 1020 1011 0 02010 2020 2011 0 01110 1120 1111 1121 02110 2120 2111 2121 02210 0 2211 2221 1

and p = (10 20 11 21 22)

,

where 2220 = 1021 = 2021 = 0 since no transitions are possible between those graphs.

The element represents (), the transition probability from graph

to graph .

36


42/55

Figure 3.14 The graph transitions for a network with two input nodes. The arrowsindicate the sense of the transitions. No transition is possible from graph 22 to any of theother graph.

The analytical results previously obtained are directly applicable to compute the el-ements of the transition matrix A. For example, the element 01 of A yields theelements 1011 and 2011 of A, as follows:

1011 = (10

11) = 0(1 0)

2

1

2011 = (20

11) = (1 0)(1 0)

1

These two elements of A correspond to the two graph transitions in Figure 3.13.

Computing () and is according to Eqs. (3.6) and (3.7) respectively. However, Eq.

(3.8) is no longer applicable for computing , as the elements of A corresponding tointerference events are no longer on the upper diagonal of matrix A.

Nevertheless, a similar computation is still possible:

=

j:columnof N

i:interference

graph of j

(3.10)

where is an element of A corresponding to an interference transition from graph to graph , p is the initial state transition vector excluding the absorbing graph and is the element of p.

3.2.5 Analytical solution: numerical evidence

To test the validity of the analytical solution, a similar methodology was used as inSub-section 3.2.2 for testing the Markov chain representation. In the present case, thenetwork test set was limited to networks with two input nodes, 10 output nodes and arange of middle layer nodes.

The analytical expressions for A and A were used to obtain (), and

from Eqs. (3.6), (3.7) and (3.8) or (3.10), respectively, and the predicted valueswere then compared to the ones obtained from the simulations.

37


43/55

The slow mapping change mode was used this time, instead of the uniform initialstate probability, as the number of states is quite small and the slow change mode isable to visit them a sufficient number of times. The resulting learning time distribu-tions () are more representative of the typical network dynamics.

100

101

102

104

102

100

Learning Time Distribution p()Gamma(2,10,10) [runs:1e+6 slow]

p()

model

pred A

pred Ag

(a) Super-critical network (2, 10, 10).

100

101

102

104

102

100

Learning Time Distribution p()Gamma(2,100,10) [runs:1e+6 slow]

p()

model

pred A

pred Ag

(b) Sub-critical network (2, 100, 10).

Figure 3.15 The learning time distribution () in the slow change mode. The greenand red lines are the prediction from Eq. (3.6) using A and A respectively.

The predicted learning time distributions () are quite similar to the ones mea-sured. This can be appreciated from Figure 3.15, where the predictions are comparedin the super-critical and the sub-critical regimes. As with the predictions from theMarkov chain validation testing, the super-critical regime distributions are less accu-rate than the sub-critical distributions.

0 20 40 60 80 100 120

0

0.01

0.02

0.03

0.04

0.05

0.06

KolmogorovSmirnov StatisticGamma(2,*,10) [runs:1e+6 slow]

D

Middle Layer Size

pred A

pred Ag

Figure 3.16 The Kolmogorov-Smirnov statistic for the learning time distribution ()in the slow change mode. The green and red lines are the statistic value obtained from Eq.(3.6) using A and A respectively.

The above is confirmed by the Kolmogorov-Smirnov statistic, which is higher for thesuper-critical and critical regimes, as show in Figure 3.16.

The statistic decreases with increasing input layer size and is comparable to themaximum likelihood estimations for the transition matrix, obtained in Sub-section

38


44/55

3.2.2 for the critical and sub-critical regimes (compare Figure 3.16 with the D value fornetwork size two from Figure 3.10).

Interestingly, the predictions obtained from A dont have better performance thanthe ones from A, except in the sub-critical regime where the statistic is lower for the

() obtained from A. This may be particular to the two input networks, which

are somewhat singled out from the others in Figure 3.10.

In terms of pred and pred, the predictions obtained from A are also more ac-curate than the ones obtained from A, as shown in Table 3.4 and Figure 3.17.

0 20 40 60 80 100 12010

15

20

25

30

Average Learning Time Gamma(2,*,10) [runs:1e+6 slow]

Middle Layer Size

model

pred A

pred Ag

0 20 40 60 80 100 1200

0.5

1

1.5

Average Interference Event Gamma(2,*,120) [runs:1e+6 slow]

Middle Layer Size

model

pred A

pred Ag

Figure 3.17 Comparing pred and pred to the measured values, in the slow changemode. The green and red lines are the prediction from Eq. (3.7) and (3.10) for A and Arespectively.

The improvement in the Kolmogorov-Smirnov statistic for A in the sub-criticalregime did not reflect in the NRMSE, where the predictions from A are consistentlymore accurate.

NMRSE pred

super-critical critical sub-critical

A 0.0839 (0.0660) 0.0577 (0) 0.0149(0.0203)A 0.1042 (0.0758) 0.0817 (0) 0.0264(0.0332)

NMRSE pred

super-critical critical sub-critical

A 0.1323 (0.1066) 0.1339 (0) 0.0452(0.0616)A 0.1704 (0.1269) 0.2000 (0) 0.0875(0.1089)

Table 3.4 The normalised root mean square error (NMRSE) for the predicted pred

and pred in the super-critical, critical and sub-critical regimes. The values in bracketsare the standard deviation on the error NMRSE.

3.2.6 Summary


39


45/55

Markov chain representation: The Chialvo-Bak network in the two-layertopology is fairly accurately represented by a first order Markov chain, where thechain states correspond to the number of learnt maps.

Markov chain statistics: The state transition matrix A obtained either ana-lytically or by maximum likelihood estimation, easily allows to compute ()pred,pred and pred.

Numerical evidence for Markov chain: The Markov chain representation isaccurate in all regimes (super-critical, critical and sub-critical) for all but verysmall networks.

Analytical solution for A for (2, , ): An approximate analytical solutionwas obtained for the network of two input nodes.

Analytical solution for A for (2, , ): An analytical solution was ob-tained for the network of two input nodes when considering the transitions be-tween graphs rather than between states. As such, this solution is in principleexact.

Numerical evidence for analytical solution: The performance of the analyt-ical solutions is comparable to a maximum likelihood estimation of the transitionmatrix A in the critical and sub-critical regimes. The predictions from the ap-proximate solution had better performance than the non-approximate solution.

3.2.7 Appendix

self-organised learning in the chialvo-bak model

Documents