a user's guide to stochastic encoder/decoders · a user's guide to stochastic...

23
* *

Upload: others

Post on 01-Apr-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders ∗

Dr S P Luttrell

The overall goal of this research is to develop the theory and practice of self-organising networksthat can discover objects and correlations in data, and the application of this to the fusion of dataderived from multiple sensors. The purpose of this report is to give a practical introduction toself-organising stochastic encoder/decoders, in which each input vector is encoded as a stochasticsequence of code indices, and then decoded as a superposition of the corresponding sequence ofcode vectors. Mathematica software for implementing this type of encoder/decoder is presented,and numerical simulations are run to illustrate a variety of emergent properties.

I. EXECUTIVE SUMMARY

Research aim: The overall goal of this research isto develop the theory and practice of self-organising net-works that can discover objects and correlations in data,and the application of this to the fusion of data derivedfrom multiple sensors.

Results: To reach the above overall goal it is necessaryto identify and study an appropriate self-organising net-work structure. To this end, this report is a self-containedtutorial which demonstrates the use of self-organisingstochastic encoder/decoder networks. A complete soft-ware suite, written in Mathematica, is presented, andmany worked examples of how to run the software aregiven.

Conclusions: The main conclusion is that stochas-tic encoder/decoder networks are simple to implement inMathematica, and they automatically (i.e. by a processof self-organisation) discover a wide range of useful waysof encoding data. These properties can then be put touse to address the problem of fusing data derived frommultiple sensors.

Customer bene�ts: The main bene�t is that thisself-organising approach to designing encoder networks,and ultimately data fusion networks, will lead to largesavings when it is applied to real-world problems. Thisbene�t arises principally from the hands-o� nature of theself-organising approach, in which the task of identifyingobjects and correlations in data is delegated to a com-puter, rather than being done manually by one or moreexpert humans.

Recommendations: Extend the approach advocatedin this report to the case of a multi-layer encoder/decodernetwork, to allow discovery by the network of more com-plicated objects and correlations in data. This will movethe research towards more realistic data fusion scenarios.

Key words: Encoder, Decoder, Stochastic VectorQuantiser, SVQ, Data Fusion, Self-Organisation

∗This paper appeared as DERA Technical Report,DERA/S&P/SPI/TR990290, 18 October 1999.c© Crown Copyright 1999 Defence Evaluation and ResearchAgency UK

II. INTRODUCTION

A. Background to This Report

The overall goal of this research is to develop the the-ory and practice of self-organising networks that can dis-cover objects and correlations in data, and to apply thisto the fusion of data derived from multiple sensors. Thisself-organising approach to designing encoder networks,and ultimately data fusion networks, will lead to largesavings when it is applied to real-world problems. Thisbene�t arises principally from the hands-o� nature of theself-organising approach, in which the task of identifyingobjects and correlations in data is delegated to a com-puter, rather than being done manually by one or moreexpert humans. This report focuses on a particular typeof self-organising network which encodes/decodes datawith minimum distortion. A useful side e�ect of optimis-ing this type of network is that it must discover objectsand correlations in data, as is required of a network thatis to be applied to data fusion problems.To visualise a minimum distortion encoder/decoder,

consider a communication system that consists of a trans-mitter, a limited bandwidth communication channel, anda receiver. In order to send a signal from the transmit-ter to the receiver, it is necessary to encode it beforeit can be accommodated within the limited bandwidthof the communication channel, and then decoded at thereceiver. Such encoding/decoding leads to the receivedsignal being a distorted version of the original, and bycarefully optimising this encoding/decoding scheme theassociated distortion can be minimised.The simplest type of encoder/decoder is the vector

quantiser (VQ) [1], which encodes each input vector asone of a �nite number of possible integers, which is thentransmitted along the communication channel, and thendecoded as one of a �nite number of possible reconstruc-tion vectors. The set of reconstruction vectors (the codebook) contains all of the information that is needed tospecify the encoder/decoder. Thus the encoder applies anearest neighbour algorithm to the code book to deter-mine which of the reconstruction vectors is closest (in theEuclidean sense) to the input vector, and then transmitsthis integer (the code index) along the communicationchannel to the decoder, which then uses it to look up thecorresponding reconstruction vector in its own identical

Page 2: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 2

copy of the code book. The codebook must be optimisedso that the average Euclidean reconstruction error is min-imised.

This type of encoder/decoder may be generalised to al-low for corruption of the code index whilst in transit overthe communication channel. The optimisation must nowmake information in the transmitted code index robustwith respect to channel distortion [2]. In the simplestcase where the code index is transmitted as an analoguesignal (a vector of voltages, say), and the communicationchannel distortion is an additive noise process, then theoptimisation process is very similar to the training algo-rithm used to optimise a topographic mapping [3]. Forthis reason, this type of encoder/decoder is known as atopographic vector quantiser.

This report discusses a further generalisation of thistype of encoder/decoder, in which the encoder uses astochastic (rather than deterministic) algorithm to pickthe code index for transmission along the communica-tion channel - this is called a stochastic vector quantiser(SVQ) (see the discussion in Section A2). An obviousdisadvantage of using a stochastic encoding algorithm isthat it must discard more information about its inputthan would an optimal deterministic encoding algorithm.However, the main advantage of stochastic encoding isthat when multiple code indices are sampled they do notall have to be the same (unlike in the case of deterministicencoding), so di�erent samples can record di�erent typesof information about the input. This e�ect could also beachieved by using multiple deterministic encodings of theinput, but this would be a manually designed approachthat is anathema to the self-organised approach that isrequired here.

B. Layout of This Report

In Appendix A all of the Mathematica [4] routines thatare required for simulating encoder/decoders are devel-oped, and in Appendix B some complicated expressionsare derived using Mathematica. The body of the reportin Section III demonstrates how these routines may beused to obtain optimal encoder/decoders (all of whichare SVQs) for various simple types of training data.

III. ENCODER/DECODER EXAMPLES

A. Preliminaries

If further background information is required, then Ap-pendix A should be read �rst of all.

B. Notation

d inputdimensionalityx inputvector (dimensionalityd)M totalnumberofcodeindicesn samplednumberofcodeindicesw matrixofweightvectors (dimensionalityM × d)b vectorofbiasses (dimensionalityM)r matrixofreconstructionvectors (dimensionalityM × d)A partitioningmatrix (unused)− seeSectionA 6fordetailsL eakagematrix (dimensionalityM ×M)− seeSectionA 6fordetailsε updatestepsizeparameter− seeSectionA 10fordetailsλ weightdecayparameter− seeSectionA 11fordetails

C. Methodology

For each simulation a number of parameters have to beinitialised. For instance, in the case of circular trainingdata (see Section IIID), the initialisation takes the form}

d = 2;M = 4;n = 10;ε = 0.05;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

The �rst row initialises all of the scalar parameters d,M , n and ε.The second row initialises the elements ofM×dmatrix

of weight vectors w to small random values uniformlydistributed in the interval [−0.05,+0.05].The third row initialises the elements of the M -

dimensional vector of biasses b to small random valuesuniformly distributed in the interval [−0.05,+0.05].The fourth row initialises the elements of the M × d

matrix of reconstruction vectors r to small random valuesuniformly distributed in the interval [−0.05,+0.05].The �fth row initialises the partitioning matrix A to

a default state which removes its e�ect (it is not usedin this report), and initialises the leakage matrix L to adefault state which removes its e�ect (in Section III L adi�erent type of L is used).The simulation may then be run with these param-

eter values by invoking the following update routine(see Section A10) as many times as required (withx={Cos[#],Sin[#]}&[2π Random[]] in the case of circulartraining data).{D12, w, b, r} = UpdateSVQ[x, w, b, r, n, ε];For convenience it is useful to record

the training history using the followingcode fragment (whose initialisation is {whis-tory,bhistory,rhistory,D12history}={{},{},{},{}}).

{whistory, bhistory, rhistory, D12history} = MapThread[Append,{{whistory, bhistory, rhistory, D12history}, {w, b, r, D12}}]

The training history contains all the information thatis required to create graphical displays that show what

Page 3: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 3

the SVQ has been doing.

D. Circle

This �rst simulation is designed to show how the SVQbehaves with data that live on a curved manifold, whichis typical of high-dimensional sensor data, such as images.A circle is the simplest type of curved manifold, so it canbe used to explore the basic SVQ behaviour. Each inputvector then has the form x = (cos θ, sin θ). Thus a circu-lar manifold has one intrinsic coordinate (the θ param-eter), but it is embedded in a 2-dimensional space (the(cos θ, sin θ) vector). An analytic solution to this particu-lar example was given in [5], where it was shown how thecircle is sliced up into overlapping arcs by the SVQ. Thefollowing simulation has all of the same behaviour as theanalytic solution, even though it is suboptimal becauseof the limited functional form of the sigmoid functionsused in the encoder.Initialise the parameter values.d = 2;M = 4;n = 10;ε = 0.05;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

These parameter values state that the input spaceis 2-dimensional (d = 2), the code book has 4 entries(M = 4), 10 code indices are sampled for each input vec-tor (n = 10), the update step size is 0.05 (ε = 0.05),the elements of the weight matrix, bias vector and recon-struction matrix are initialised to uniformly distributedrandom numbers in the interval [−0.05,+0.05], the par-titioning matrix and the leakage matrix are initialised toa default state in which their e�ect is switched o�.Train on 400 vectors derived from

{Cos[#],Sin[#]}&[2π Random[]], which generatespoints on the unit circle.The training history of the loss function D1 + D2 is

shown in Figure 1, where every 10th sample is shown.This shows the expected downward trend towards con-vergence.The training history of the rows of the reconstruction

matrix is shown in Figure 2, where every 10th sampleis shown and the �nal state is highlighted with shadedcircles. This shows the expected outward drift from theinitial random state near the origin, to eventually jitterabout just outside the unit circle. For each input vectoron the unit circle, the reconstruction vector is a linearcombination of the rows of the reconstruction matrix,where the coe�cients of the linear combination are non-negative and sum to unity; this explains why the rows ofthe reconstruction matrix lie outside the unit circle.The posterior probabilities that each code index is se-

lected for input vectors lying in the square [−1,+1]2 (i.e.

Figure 1: The training history of the loss function for n = 10.Every 10th sample is shown.

Figure 2: The training history of the rows of the reconstruc-tion matrix for n = 10. Every 10th sample is shown and the�nal state is highlighted.

not only the unit circle) are shown in Figure 3. The only

points in [−1,+1]2 for which these posterior probabilitiesare actually used is the unit circle.

The contour plots of the same posterior probabilitiesare shown in Figure 4. As in Figure 3, the only pointsin [−1,+1]2 for which these posterior probabilities areactually used is the unit circle.

If the same simulation is repeated, but using n = 250,then the results are shown in Figure 5, Figure 6, Figure7, and Figure 8.

The loss function is smaller in Figure 5 than in Figure1. This is because more code indices are used in Figure 5,which preserves more information about the input vector,thus allowing a more accurate reconstruction to be made.

The rows of the reconstruction matrix r are larger inFigure 6 than in Figure 2. Also, the posterior probabil-ities that each code index is selected overlap more witheach other in Figure 8 than in Figure 4. As before, thise�ect is explained by the need for the reconstruction tolie near the unit circle when formed from a (constrained)

Page 4: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 4

Figure 3: The posterior probabilities that each code index isselected for n = 10.

Figure 4: Contour plots of the posterior probabilities thateach code index is selected for n = 10.

linear combination of the rows of the reconstruction ma-trix.

If the same simulation is repeated, but using n = 2,then the results are shown in Figure 9, Figure 10, Figure11, and Figure 12.

When comparing Figure 99Figure 12 with Figure11Figure 4 all the trends are the opposite that were ob-served when comparing Figure 5 to Figure 8 with Figure1 to Figure 4, as expected.

Ideally, in the n = 1 case a pure vector quantiser wouldbe obtained, in which the circle is partitioned into non-overlapping arcs (each covering π

2 radians in this case),and the rows of the reconstruction matrix would lie justinside the unit circle at the centroids of each of these arcs.

It should be noted that, for an encoder based on sig-moid functions, the ideal vector quantiser result cannotbe obtained when the input data lives on an arbitrarilychosen manifold, because sigmoid functions have a highly

Figure 5: The training history of the loss function for n = 250.Every 10th sample is shown.

Figure 6: The training history of the rows of the reconstruc-tion matrix for n = 250. Every 10th sample is shown and the�nal state is highlighted.

restricted functional dependence on the input vector.

E. 2-Torus

The circular manifold used in Section IIID has onlyone intrinsic coordinate, so it cannot be used to investi-gate any new SVQ behaviour that emerges when the datalive on a higher dimensional curved manifold. However,if a pair of circles is used, so that each data vector is givenby x = (cos θ1, sin θ1, cos θ2, sin θ2), then the manifold hastwo intrinsic coordinates, which may be used to investi-gate SVQ behaviour for data that live on a 2-dimensionalcurved manifold. This manifold has two intrinsic co-ordinates (the (θ1, θ2) vector), but it is embedded in a4-dimensional space (the (cos θ1, sin θ1, cos θ2, sin θ2) vec-tor). A manifold that is formed in this way from twocircular manifolds is a 2-torus, which has the familiardoughnut shape when it is projected down into only threedimensions.

Page 5: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 5

Figure 7: The posterior probabilities that each code index isselected for n = 250.

Figure 8: Contour plots of the posterior probabilities thateach code index is selected for n = 250.

In [5] the case of a 2-torus was solved analytically toreveal that the behaviour of an SVQ depended on thesize M of the code book and the number n of sampledcode indices. The following simulation has all of the samebehaviour as the analytic solution, even though it is sub-optimal because of the limited functional form of the sig-moid functions used in the encoder.

Initialise the parameter values.

d = 4;M = 8;n = 50;ε = 0.05;λ = 0.005;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]};L = IdentityMatrix[M];

Figure 9: The training history of the rows of the reconstruc-tion matrix for n = 2. Every 10th sample is shown.

Figure 10: The training history of the rows of the reconstruc-tion matrix for n = 2. Every 10th sample is shown and the�nal state is highlighted.

These parameter values state that the input space is4-dimensional (d = 4), the code book has 8 entries (M =8), 50 code indices are sampled for each input vector (n =50), the update step size is 0.05 (ε = 0.05), the weightdecay parameter is 0.005 (λ = 0.005), the elements of theweight matrix, bias vector and reconstruction matrix areinitialised to uniformly distributed random numbers inthe interval [−0.05,+0.05], the partitioning matrix andthe leakage matrix are initialised to a default state inwhich their e�ect is switched o�.

Weight decay is used to impose a prior bias towardssolutions that have few non-zero entries in the weightmatrix, because it is known that the optimal solution inthis case is a factorial encoder [5].

Train on 1000 vectors derived fromFlatten[Table[{Cos[#],Sin[#]}&[2π Random[]],{2}]],which generates points on a 2-torus formed from theCartesian product of a pair a unit circles.

The training history of the loss function is shown inFigure 13. This shows the expected downward trend to-

Page 6: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 6

Figure 11: The posterior probabilities that each code index isselected for n = 2.

Figure 12: Contour plots of the posterior probabilities thateach code index is selected for n = 2.

wards convergence.

The training history of the rows of the reconstructionmatrix is shown in Figure 14, where the left hand plot isone circular subspace of the 2-torus, and the right handplot is the other circular subspace. In both of these sub-spaces 4 of the rows of the reconstruction matrix behavein a similar way to the case of training data derived froma unit circle (e.g. see Figure 6), whereas the other 4 rowsof the reconstruction matrix remain much closer to theorigin. Also, the 4 large rows of the reconstruction ma-trix in the left hand plot pair up with the 4 small rowsof the reconstruction matrix in the right hand plot.

Weight decay has a symmetry breaking side e�ect, inwhich the large components of the rows of the reconstruc-tion matrix drift around until they become axis aligned,as is clearly seen in Figure 14.

The posterior probabilities that each code index is se-lected for input vectors lying on the 2-torus are shown in

Figure 13: The training history of the rows of the reconstruc-tion matrix for n = 50. Every 10th sample is shown.

Figure 14: The training history of the rows of the reconstruc-tion matrix for n = 50. The two circular subspaces of the2-torus are shown separately.

Figure 15. In each plot the horizontal and vertical axeswrap circularly to form a 2-torus.

Figure 15: Discretised versions of the posterior probabilitiesthat each code index is selected for input vectors lying on the2-torus and for n = 50.

The results shown in Figure 15 show that the codebook operates as a factorial encoder, in which half of thecode indices encode one of the circular subspaces, andthe other half encode the other subspace. Each pair ofvertical and horizontal stripes then intersect to de�nesa small patch of the 2-torus, and input vectors lying inthat small patch will be encoded mostly as samples of thecorresponding pair of code indices, with a little overspillinto other code indices in general. A factorial encoderoperates by a process that is akin to triangulation, byslicing up the input space into intersecting subspaces.An SVQ can only do this provided that the number ofcode indices that are sampled is large enough to virtually

Page 7: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 7

guarantee that each of the subspaces has at least one codeindex associated with it.If the same simulation is repeated, but using n = 2 and

ε = 0.1, then the results are shown in Figure 16, Figure17, and Figure 18.

Figure 16: The training history of the rows of the reconstruc-tion matrix for n = 2. Every 10th sample is shown.

Figure 17: The training history of the rows of the reconstruc-tion matrix for n = 2. The two circular subspaces of the2-torus are shown separately. Note the increase in scale by afactor of 2 compared with Figure 14.

Figure 18: Discretised versions of the posterior probabilitiesthat each code index is selected for input vectors lying on the2-torus and for n = 2.

The results shown in Figure 18 show that the codebook operates as a joint encoder, in which each codeindex jointly encodes both of the circular subspaces ofthe 2-torus (each code index encodes a small patch of the2-torus, as is clearly seen in Figure 18). This is becausethere are too few code indices (n = 2) being sampled toallow a factorial encoder a good chance of encoding bothsubspaces, and thus to have a small loss function.

Note that to obtain the results shown in Figure 18weight decay has been left switched on (although it isactually unnecessary in this case) in order to make surethat the change from factorial to joint encoder is gen-uinely caused by reducing the value of n. Also note thatthe value of ε is increased (relative to that used in thefactorial encoder simulation) to o�set the tendency forthe training algorithm to become trapped in a frustratedcon�guration when n is small.This transition between joint and factorial encoding of

a 2-torus has been correctly predicted by a exact analyticoptimisation of the loss function [5]. However, the factthat it was possible to do an analytic calculation at alldepended critically on the high degree of symmetry of the2-torus. Such exact analytic calculations are not possiblein the general case.

F. Imaging Sensor

The simulations in Section IIID (circular manifold)and Section III E (toroidal manifold) showed how an SVQbehaves when the input data lie on an idealised curvedmanifold. In this Section these simulations will be ex-tended to more realistic data which lie on a curved man-ifolds with either a circular or a toroidal topology. Themanifolds studied in this Section have the same topology

as the earlier idealised manifolds, but they do not havethe same geometry . However, if only their local proper-ties are considered, then the idealised manifolds are verygood approximations to the more realistic manifolds.In these simulations a target will be imaged by one

or more 1-dimensional sensors. The output of each sen-sor will be a vector of pixel values which depends onthe target's position relative to the sensor. For a targetthat lives on a 1-dimensional manifold, the vector of im-ages (derived from all of the sensors) lies on a manifoldwith one intrinsic coordinate. This generalises straight-forwardly to the case of a target that lives on a multi-dimensional manifold, and yet further to the case of mul-tiple targets. The general rule is that the number ofcontinous parameters needed to describe the state of thesystem under observation is equal to the dimensional-ity of the manifold on which the sensor data live, eventhough the actual dimensionality of the sensor data isusually much higher (e.g. imaging sensors).In Section IIIG the case of 1 target living in 1 di-

mension imaged by 1 sensor is simulated (e.g. a rangepro�le in which 1 target is embedded), which is approxi-mated by the circular case simulated in Section IIID. InSection IIIH the case of 1 target living in 2 dimensionsindependently imaged by 2 sensors is simulated (e.g. arange pro�le and an azimuth pro�le in which 1 target isembedded), which is approximated by the toroidal casesimulated in Section III E. In Section III I the case multi-ple independent (but identical) targets living in 1 dimen-sion imaged by 1 sensor is simulated (e.g. a range pro�lein which multiple independent targets are embedded).

Page 8: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 8

Note that the case of 1 target living in 2 dimensionsindependently imaged by 2 sensors (see Section IIIH) isnot identical to the case of 2 independent (but identical)targets living in 1 dimension imaged by 1 sensor. Di�er-ences between these two cases arise only when the imagesof the 2 targets on the 1 sensor overlap each other.

When the target is centred on the sensor it will bemodelled as (d = number of sensor pixels, s = standarddeviation of the Gaussian pro�le function that is used torepresent the target)

target = Table[Exp[-(i - d

2)2

2s2], {i, d}];

G. Imaging Sensor (circular topology)

Initialise the parameter values.

d = 20;M = 4;n = 10;ε = 0.1;s = 2;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

These parameter values state that the sensor has 20pixels (d = 20), the code book has 4 entries (M = 4), 10code indices are sampled for each input vector (n = 10),the update step size is 0.1 (ε = 0.1), the target half-widthis 2 (s = 2), the elements of the weight matrix, bias vectorand reconstruction matrix are initialised to uniformly dis-tributed random numbers in the interval [−0.05,+0.05],the partitioning matrix and the leakage matrix are ini-tialised to a default state in which their e�ect is switchedo�.

Train on 400 vectors derived fromRotateRight[target,Random[Integer,{0,d-1}]], whichcentres the target on a randomly selected sensor pixel.Circular wraparound is used.

The training history of the loss function is shown inFigure 19, which should be compared with the roughlycomparable case of training data derived from a unit cir-cle in Figure 1. In this case the loss function is much nois-ier, because although the input manifold here is topologi-cally equivalent to a circle (embedded in a 20-dimensionalspace), it is not geometrically a circle (embedded in a 2-dimensional space), which makes the encoding/decodingproblem harder than before.

The training history of the rows of the reconstructionmatrix is shown in Figure 20. Each image displays theentire training history of a single row reading down thepage.

The posterior probabilities that each code index is se-lected as a function of target position are shown in Fig-ure 21. Each code index responds smoothly to a localisedrange of target locations.

Figure 19: The training history of the loss function. Every10th sample is shown.

Figure 20: The training history of the rows of the reconstruc-tion matrix. Each image displays the entire training historyof a single row reading down the page.

H. Independent Imaging Sensors (2-toroidaltopology)

Initialise the parameter values.d = 40;M = 8;n = 50;ε = 0.1;λ = 0.005;σ = 2;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

These parameter values state that each sensor has 20pixels (d = 40 = 2 × 20), the code book has 8 entries(M = 8), 50 code indices are sampled for each input vec-tor (n = 50), the update step size is 0.1 (ε = 0.1), theweight decay parameter is 0.005 (λ = 0.005), the targethalf-width is 2 (σ = 2), the elements of the weight ma-trix, bias vector and reconstruction matrix are initialisedto uniformly distributed random numbers in the interval[−0.05,+0.05], the partitioning matrix and the leakagematrix are initialised to a default state in which theire�ect is switched o�.Generate each training vector usingFlatten[Table[RotateRight[target, Random[Integer, {0, d

2- 1}]], {2}]]

Page 9: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 9

Figure 21: The posterior probabilities that each code index isselected for all possible locations of the target.

This can be broken down into individual steps thus:1. target centres the target on a sensor.2. RotateRight[...,Random[Integer, {0, d2 − 1}]] cen-

tres the target at a randomised position on a sensor usingcircular wraparound.

3. Table[...,{2}] generates two independently ran-domised instances of such targets.

4. Flatten[...] concatenates these into a single inputvector.Train on 400 vectors derived as above.The training history of the loss function is shown in

Figure 22.

Figure 22: The training history of the loss function for n = 50.Every 10th sample is shown.

The training history of the rows of the reconstructionmatrix is shown in Figure 23. Each image displays theentire training history of a single row reading down thepage.

Figure 23: The training history of the rows of the recon-struction matrix for n = 50. Each image displays the entiretraining history of a single row reading down the page.

The posterior probabilities that each code index is se-lected as a function of target position are shown in Figure24, which should be compared with Figure 15.

Figure 24: The posterior probabilities that each code index isselected for all possible locations of the target on each sensorfor n = 50.

The results shown in Figure 24 show that the codebook operates as a factorial encoder for the same reasonsa were discussed in the context of Figure 15.

If the same simulation is repeated, but using n = 2 andincreasing the number of training vectors to 1000, thenthe results are

Figure 25: The training history of the loss function for n = 2.Every 10th sample is shown.

Figure 26: The training history of the rows of the reconstruc-tion matrix for n = 2. Each image displays the entire traininghistory of a single row reading down the page.

The results shown in Figure 27 are not perfect, becausethe training process has got itself stuck in a trappedcon�guration. However, they show that the codebookmainly operates as a joint encoder for the same reasonsthat were discussed in the context of Figure 18.

Page 10: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 10

Figure 27: The posterior probabilities that each code index isselected for all possible locations of the target on each sensorfor n = 2.

I. Imaging Sensor with Multiple IndependentTargets

Initialise the parameter values.d = 100;M = 15;n = 10;ε = 0.1;λ = 0.005;σ = 2;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

These parameter values state that the sensor has 100pixels (d = 100), the code book has 15 entries (M = 15),10 code indices are sampled for each input vector (n =10), the update step size is 0.1 (ε = 0.1), the weight decayparameter is 0.005 (λ = 0.005), the target half-width is2 (σ = 2), the elements of the weight matrix, bias vectorand reconstruction matrix are initialised to uniformly dis-tributed random numbers in the interval [−0.05,+0.05],the partitioning matrix and the leakage matrix are ini-tialised to a default state in which their e�ect is switchedo�.Generate each training vector usingApply[Plus,Table[RotateRight[target, Random[Integer, {0, d - 1}]], {10}]]

This can be broken down into individual steps thus:1. target centres the target on a sensor.2. RotateRight[...,Random[Integer, {0, d2−1}]] centres

the target at a randomised position on a sensor usingcircular wraparound.

3. Table[...,{10}] generates 10 independently ran-domised instances of such targets.

4. Apply[Plus,...] sums these to give a single inputvector.A typical example of such an input vector is shown in

Figure 28.Train on 1000 vectors derived as above.The training history of the loss function is shown in

Figure 29.The training history of the rows of the reconstruction

matrix is shown in Figure 30. Each image displays the

Figure 28: Superposition of 10 Gaussian targets.

Figure 29: The training history of the loss function. Every10th sample is shown.

entire training history of a single row reading down thepage. After some initial confusion each code index beginsto respond to a very localised region of the input space.

Figure 30: The training history of the rows of the reconstruc-tion matrix. Each image displays the entire training historyof a single row reading down the page.

The posterior probabilities that each code index is se-lected as a function the position of single test target areshown in Figure 31. Each code index encodes a smallpatch of the input space, and there is a little overlapadjacent patches.The results shown in Figure 31 shows that the code

book operates very clearly as a factorial encoder, becausedespite the training data consisting of a large number ofsuperimposed targets (see Figure 28), the code indicesessentially code for single targets. In e�ect, the minimi-sation of the loss function has discovered the fundamen-tal constituents out of which the training data have beenbuilt.This behaviour is reminiscent of independent compo-

Page 11: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 11

Figure 31: The posterior probabilities that each code index isselected for all possible locations of a single test target.

nent analysis (ICA) [6], where an input that is an un-known mixture of a number of independent unknowncomponents is analysed to discover the mixing matrixand the components. In the above simulation, the inputis an unknown mixture of components, where the mixingmatrix (whose entries that are 0's and 1's in this case)forms a mixture of components, where each componentis a target located at a particular position. Optimisingthe SVQ (on a training set of data) discovers the form ofthe components (as displayed in Figure 30), and subse-quently using the SVQ (on a test set of data) to computeposterior probabilities e�ectively derives an estimate ofthe mixing matrix (e.g. for a single test target this es-timate can be deduced from Figure 31) for each inputvector.

J. Noisy Bars

In this simulation the problem is to encode an imagewhich consists of a number of horizontal or vertical (butnot both at the same time) noisy bars, in which eachimage pixel independently has a large amount of multi-plicative noise and a small amount of additive noise (thistype of training set was proposed in [7]). This is a morecomplicated version of the multiple target simulation inSection III I, where each horizontal bar is one type of tar-get and each vertical bar is another type of target, andonly one of these two types of targets is allowed to bepresent in each image.Initialise the parameter values.K = 6;d = K2;M = 2K;n = 20;ε = 0.1;ρ = 0.3;σ = 0.2;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];These parameter values state that the image is 6 by

6 pixels (K = 6), the total number of pixels is 36(d = K2 = 36), the code book has 12 entries (M =2K = 12), 20 code indices are sampled for each inputvector (n = 20), the update step size is 0.1 (ε = 0.1), theprobability of a bar being present is 0.3 (ρ = 0.3), thebackground noise level is 0.2 (σ = 0.2), the elements ofthe weight matrix, bias vector and reconstruction matrixare initialised to uniformly distributed random numbersin the interval [−0.05,+0.05], the partitioning matrix andthe leakage matrix are initialised to a default state inwhich their e�ect is switched o�.Generate each training image usingIf[Random[] < 0.5, # , Transpose[ # ]] &[Map[# +σ Random[] &, # , {2}] &[Table[# Table[Random[], {K}], {K}] &[Table[If[Random[] < ρ, 1, 0], {K}]]]];This can be broken down into individual steps thus:1. Table[If[Random[]<ρ,1,0],{K}] is a bit vector that

decides at random whether each row of the image has abar present with probability ρ.

2. Table[# Table[Random[],{K}],{K}]&[...] generatesa whole image such that each column of the image is theproduct of the bit vector and an independent uniformlydistributed random number in the interval [0, 1].

3. Map[#+σ Random[]&,#,{2}]&[...] adds an in-dependent uniformly distributed random number in therange [0, σ] to each pixel value.

4. If[Random[]<0.5,#,Transpose[#]]&[...] transposesthe whole image with probability 1

2 , ensures that the gen-erated image is equally likely to consist of horizontal barsor vertical bars.Some typical examples of such input images are shown

in Figure 32.

Figure 32: Some typical examples of noisy bar images.

Train on 4000 images derived as above.The training history of the loss function is shown in

Figure 33. This has a very noisy behaviour because thetraining data are very noisy, yet the codebook is relativelysmall.The training history of the rows of the reconstruction

matrix is shown in Figure 34.In order to make them easier to interpet, the rows of

the reconstruction matrix may be displayed in image for-mat as shown in Figure 35, where it is clear that eachcode index encodes exactly one horizontal or vertical bar.

The results shown in Figure 35 show that the codebook operates very clearly as a factorial encoder, because

Page 12: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 12

Figure 33: The training history of the loss function. Every100th sample is shown.

Figure 34: The training history of the rows of the reconstruc-tion matrix. Each image displays the entire training historyof a single row reading down the page. Every 100th sampleis shown.

despite the training data consisting of images of a vari-able number of horizontal (or vertical) bars, the codeindices essentially code for single horizontal (or vertical)bars. In e�ect, the minimisation of the loss function hasdiscovered the fundamental constituents out of which thetraining data have been built.

If the training image generator is altered so that eachpixel within a bar has the same amount of multiplicativenoise (i.e. the multiplicative noise is correlated), thentraining tends to get trapped in frustrated con�gurations.This correlated multiplicative noise training image gener-ator is basically the same as the one used in [7], and canbe implemented by making the following replacementsto the above uncorrelated multiplicative noise trainingimage generator:

1. Table[If[Random[]<ρ,1,0],{K}]−→Table[If[Random[]<ρ,Random[],0],{K}]

2. Table[# Table[Random[],{K}],{K}]&[...]−→Table[#,{K}]&[...]

Figure 35: The rows of the reconstruction matrix displayedin image format.

K. Stereo Disparity

In each simulation that has been presented thus far thedata live on a manifold whose intrinsic coordinates arestatistically independent of each other. The purpose ofthis simulation is to demonstrate what happens when theintrinsic coordinates are correlated.

In this simulation the problem is to encode a pair of 1-dimensional images of a target which derive from the twosensors of a stereoscopic imaging system. The location ofthe target on a sensor is speci�ed by a single intrinsic co-ordinate, and the pair of such coordinates (one for each ofthe two sensors) are correlated with each other, becausethe target appears in similar positions on each sensor inthe stereoscopic imaging system. Also, each image pixelindependently has a large amount of multiplicative noiseand a small amount of additive noise.

Initialise the parameter values.K = 18;d = 2K;M = 24;n = 3;ε = 0.1;s = 2.0;a = 4.0;σ = 0.2;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]}; L = IdentityMatrix[M];

These parameter values state that each 1-dimensionalimage has 18 pixels (K = 18), the total number of pix-els is 36 (d = 2K = 36), the code book has 24 entries(M = 24), 3 code indices are sampled for each inputvector (n = 3), the update step size is 0.1 (ε = 0.1),the target half-width is 2 (s = 2.0), the half-range ofstereo disparities is 4.0 (a = 4.0), the background noiselevel is 0.2 (σ = 0.2), the elements of the weight ma-trix, bias vector and reconstruction matrix are initialisedto uniformly distributed random numbers in the interval

Page 13: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 13

[−0.05,+0.05], the partitioning matrix and the leakagematrix are initialised to a default state in which theire�ect is switched o�.Generate each training vector using

target = Map[Table[Exp[-(i - Floor[ K

2] - #)

2

2s2], {i, K}] &,

{0, Random[Real, {-a, a}]}]

Map[# +σ Random[] &, # , {2}] &[Map[# Random[] &,RotateRight[target, {0, Random[Integer, {0, K - 1}]}], {2}]]This can be broken down into individual steps thus:1. target is a stereo image of a target obtained by

centring a target on one of the sensors and generating arandomly shifted copy (the shift is uniformly distributedin [−a, a]) on the other sensor.

2. RotateRight[...,{0,Random[Integer,{0,K-1}]}] cen-tres the stereo image of the target at a randomised posi-tion on the sensors using circular wraparound.

3. Map[# Random[]&,...,{2}] multiplies each pixelvalue by a random number uniformly distributed in [0, 1].

4. Map[#+σ Random[]&,...,{2}] adds an independentuniformly distributed random number in the range [0, σ]to each pixel value.Some typical examples of such input images are shown

in Figure 36.

Figure 36: Some typical examples of stereo target images.

Train on 2000 stereo images derived as above.The training history of the loss function is shown in

Figure 37.

Figure 37: The training history of the loss function for n = 2.Every 100th sample is shown.

The training history of the rows of the reconstructionmatrix is shown in Figure 38.In order to make them easier to interpet, the rows of

the reconstruction matrix may be displayed in image for-

Figure 38: The training history of the rows of the reconstruc-tion matrix for n = 2. Each image displays the entire traininghistory of a single row reading down the page. Every 100thsample is shown.

mat as shown in Figure 39, where it is seen that eachcode index typically encodes a stereo image of a targetat a given position and with a given stereo disparity.

Figure 39: The rows of the reconstruction matrix displayedin stereo image format for n = 2.

The posterior probabilities that each code index is se-lected as a function of the mean position of the two im-ages (horizontal axis) and stereo disparity (vertical axis)of a test target are shown in Figure 40.

Figure 40: The posterior probabilities that each code index isselected as a function of the mean position of the two images(horizontal axis) and stereo disparity (vertical axis) of a testtarget for n = 2.

The results shown in Figure 40 show that the codebook operates very clearly as a joint encoder, becauseeach code index jointly encodes position and stereo dis-

Page 14: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 14

parity. The disparity direction is resolved into approx-imately 3 disparities (positive, zero, and negative dis-parity), whereas the position direction is resolved intoapproximately 8 positions, giving a total of 24 (= 3× 8)di�erent possible codes. The measurement of stereo dis-parity (and position) to this resolution requires only onecode index to be observed.If the same simulation is repeated, but using n = 50,

then the results are shown in Figure 41, Figure 42, Figure43, and Figure 44.

Figure 41: The training history of the loss function for n = 50.Every 100th sample is shown.

Figure 42: The training history of the rows of the recon-struction matrix for n = 50. Each image displays the entiretraining history of a single row reading down the page. Every100th sample is shown.

Figure 43: The rows of the reconstruction matrix displayedin stereo image format for n = 50.

The results shown in Figure 44 show that the codebook operates very clearly as a factorial encoder, because

Figure 44: The posterior probabilities that each code indexis selected as a function of the position (horizontal axis) andstereo disparity (vertical axis) of a test target for n = 50.

each code index encodes a linear combination of positionand stereo disparity. However there are 2 subsets of codeindices, one of which has a negative slope and the otherof which has a positive slope (as seen in Figure 44). Theintersections between these two subsets may be used totriangulate small patches on the input manifold by thesame process that was described in the context of Fig-ure 15. The measurement of stereo disparity requiresa minimum of two code indices to be observed, whichmust belong to oppositely sloping subsets in Figure 44.In practice, many more than two code indices must beobserved to virtually guarantee that there is at least onein each of the two subsets.

L. Topographic Map

The purpose of this simulation is to show how some de-gree of control can be exercised over the properties of eachcode index. Intuitively, if the code book wants to use aparticular code index, but is frustrated in this attempt bythe presence of random cross-talk between code indices,and is forced to randomly use one member of a set of codeindices instead, then the amount of information that iscarried by the code index that is actually (and randomly)selected is thereby reduced. However, if the code bookcan con�gure itself so that random cross-talk occurs onlybetween code indices that code for similar inputs, thenthe information loss can be reduced to a minimum. Con-versely, if a particular type of con�guration is required inthe code book, then it can be encouraged by deliberatelyintroducing the appropriate type of random cross-talk.In this report random cross-talk is introduced by the

M ×M leakage matrix L. This is a transition matrix, inwhich the elements of a given row are the probabilitiesthat a corresponding code index gets randomly convertedinto each of the M possible code indices.Here the problem is to encode a 2-dimensional image of

a randomly placed target, and to encourage the codebookto develop a 2-dimensional topology, such that the codeindices can be viewed as living in a 2-dimensional space

Page 15: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 15

corresponding to the 2-dimensional manifold on whichthe target image lives. This can be encouraged by ar-ranging the code indices on a 2-dimensional square grid,and then introducing random cross-talk between code in-dices that are neighbours on the grid. As was shown in[2], this is closely related to prescription for generating atopographic map [3].Note that although the 2-dimensional input manifold

is continuous, the 2-dimensional grid on which the codeindices live is discrete. If only one code index is sampled(i.e. n = 1), then the optimum SVQ is a vector quan-tiser, which discontinuously maps the continuous inputmanifold onto a discrete code index. However, if morethan one code index is sampled (i.e. n > 1) then this dis-continuity is blurred, and when a su�ciently large num-ber of code indices are sampled the discontinuity disap-pears altogether, and the continuous input manifold ise�ectively mapped onto a continuous output manifold.This is clearly seen in the limiting case n −→ ∞, wherethe output is e�ectively the frequency distribution of thenumber of times each code index is sampled, which is acontinuous function of the input.Initialise the parameter values.K = 6;d = K2;M0 = 12;M = M02;n = 2;ε = 0.1;s = 1.0;w = Table[0.1(Random[] - 0.5), {M}, {d}];b = Table[0.1(Random[] - 0.5), {M}];r = Table[0.1(Random[] - 0.5), {M}, {d}];A = {Table[1, {M}]};These parameter values state that the image is 6 by

6 pixels (K = 6), the total number of pixels is 36(d = K2 = 36), the code book is 12 by 12 entries(M0 = 12), the total number of code book entries is 144(M = M2 = 144), 2 code indices are sampled for eachinput vector (n = 2), the update step size is 0.1 (ε = 0.1),the target half-width is 1 (s = 1.0), the elements of theweight matrix, bias vector and reconstruction matrix areinitialised to uniformly distributed random numbers inthe interval [−0.05,+0.05], the partitioning matrix is ini-tialised to a default state in which its e�ect is switchedo�.The leakage matrix is initialised thus:

L0 = Map[Flatten, Flatten[Table[Exp[-(i1 - i2)2 + (j1 - j2)2

2σ2 ],{i1, M0}, {j1, M0}, {i2, M0}, {j2, M0}], 1]];IndentingNewLine

L = Transpose[Map[ #Apply[Plus, #]

&, Transpose[L0]]];

This can be broken down into individual steps thus:

1. Table[Exp[− (i1−i2)2+(j1−j2)2

2σ2 ], {i1,M0}, {j1,M0}, {i2,M0}, {j2,M0}]is a 4-dimensional list of unnormalised leakage matrixelements de�ning a 2-dimensional Gaussian neighbour-hood with half-width σ. This acts as a transformationfrom a 2-dimensional image to a 2-dimensional image.

2. Flatten[...,1] combines the i1 and j1 indices into a

single index, leaving a 3-dimensional list overall.3. Map[Flatten,...] combines the i2 and j2 indices into

a single index, leaving a 2-dimensional list overall. Thisacts as a transformation from a 1-dimensional "�attened"version of the image to a 1-dimensional "�attened" ver-sion of the image.

4. Transpose[Map[ #Apply[Plus,#]&,Transpose[L0]]] nor-

malises the leakage matrix elements to ensure that prob-ability is conserved.Generate each training vector using

target = Table[Exp[-(i - # [[1]])2 + (j - # [[2]])2

2s2], {i, K}, {j, K}] &[

Table[Random[Real, {1, K}], {2}]];

This can be broken down into individual steps thus:1. Table[Random[Real,{1,K}],{2}] generates a list of

2 random numbers in [1,K] which is the location of thetarget.

2. Table[Exp[− (i−#[[1]])2+(j−#[[2]])2

2s2 ], {i,K}, {j,K}]&[...]generates a K by K image of pixel values of a Gaussiantarget.Some typical examples of such input images are shown

in Figure 45.

Figure 45: Some typical examples of Gaussian target images.

Train on 100 images derived as above using a leakagematrix de�ning a 2-dimensional Gaussian neighbourhoodwith half-width σ = 5.0. Then train on a further 100images using half-width σ = 2.5.The training history of the loss function is shown in

Figure 46.

Figure 46: The training history of the loss function. Every10th sample is shown.

The training history of the rows of the reconstructionmatrix is shown in Figure 47.

Page 16: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 16

Figure 47: The training history of the rows of the reconstruc-tion matrix. Each image displays the entire training historyof a single row reading down the page. Every 10th sample isshown.

In order to make them easier to interpet, the rows ofthe reconstruction matrix may be displayed in image for-mat as shown in Figure 48, where it is seen that the codeindices are topographically organised, and each code in-dex typically encodes an image of a target at a givenposition.

Figure 48: The rows of the reconstruction matrix displayedin image format.

The training history of the rows of the reconstructionmatrix may be displayed in a much more vivid way. Com-pute the centroid of each of the rows of the reconstructionmatrix (when arranged in image format as in Figure 48),and then draw vectors between the centroids of rows ofthe reconstruction matrix corresponding to neighbouringcode indices. The result of this is shown in Figure 49.The evolution of the topographic map shown in Figure

49 starts from a small crumpled map, and then gradu-ally unfolds to yield the �nal result. In the top row theleakage matrix elements de�ne a 2-dimensional Gaussianneighbourhood with half-width 5.0, and then the half-width is reduced to 2.5 (i.e. the leakage is reduced) in

Figure 49: Each step in the training history of the rows ofthe reconstruction matrix representated as a sequence of to-pographic maps, which should be read left to right, �rst rowthen second row. Every 20th sample is shown.

the bottom row. Thus the �rst half of the simulation isrun with a large amount of leakage in order to encour-age the topographic map to develop smooth long-rangeorder. If this is not done, then typically the topographicmap gets trapped in a frustrated con�guration in whichit is folded or twisted over on itself.

The contraction which is observed at the edge of thetopographic map can be alleviated by making the half-width of the Gaussian leakage neighbourhood (perpen-dicular to the edge of the map) in the vicinity of theedge of the map smaller than in the centre of the map.This re�nement to the training algorithm is not exploredhere.

IV. CONCLUSIONS

It has been shown in this report how a stochastic en-coder/decoder (speci�cally, a stochastic vector quantiser(SVQ)) may be used to discover useful ways of encodingdata. The body of the report consists of a number ofworked examples, each of which is carefully designed toillustrate a particular point, and the appendices give thecomplete Mathematica source code for implementing theSVQ approach.

The idealised simulations in Section IIID (input datalive on a circle) and Section III E (input data live on a2-torus) can be used to understand the results obtain inthe simulations in Section III F (input data is target(s)viewed by imaging sensor(s)). This underlines the useful-ness of the results that were obtained in [5], where the en-coding of circular and toroidal input manifolds was solvedanalytically using the algebraic manipulator Mathemat-

ica [4].

The simulations in Section IIIK illustrate how theseresults may be extended to the case of correlated sensors,by examining the case of stereo disparity.

The results presented in this report illustrate a vari-ety of possible behaviours that a 2-layer encoder/decodernetwork can exhibit. Each behaviour can be interpretedas the discovery by the network of objects (e.g. targets)and correlations (e.g. stereo disparity) in data derivedfrom one or more sensors. This forms the basis of anapproach to the fusion of data from multiple sensors.

Page 17: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 17

V. RECOMMENDATIONS

Extend the approach advocated in this report to thecase of a multi-layer encoder/decoder network, to allowdiscovery by the network of more complicated objects andcorrelations in data. This will move towards a more real-istic data fusion scenario. For instance, the �rst stage ofencoding might be used to discover the fact that the dataconsist of a superposition of individual targets, whereasthe second stage of encoding might be used to discoverthat the positions of the targets tend to be correlatedwith each other, and so on.

VI. ACKNOWLEDGEMENTS

I thank Chris Webber for many useful conversationsthat we had during the course of this research.

Appendix A: ENCODER/DECODER PRACTICE

In this Appendix the Mathematica routines for imple-menting encoder/decoders will be developed. This startsin Section A1 where the familiar vector quantiser (VQ) isoutlined. In Section A2 this is generalised to a stochas-tic vector quantiser (SVQ). In Section A6 this is furthergeneralised to handle the cases of limited network con-nectivity and leakage of probability [8].

1. Deterministic Vector Quantiser (VQ)

A vector quantiser (VQ) is the simplest type of en-coder/decoder [1], and it is the foundation on which ev-erything else in this report is based. A brief descriptionof a VQ is given in Section IIA.A.1.2 The VQ encoding algorithm is:1. Compute the di�erence vector between the input

vector and each code book vector.2. Compute length-squared of each di�erence vector.3. Compute the minimum of these length-squared

di�erences.4. Compute position in the code book of the minimum

of these length-squared di�erences.5. Return the result of step 4 (the code index).An implementation of the encoding algorithm is (x =

inputvector, c = codebook, return = codeindex):EncodeVQ[x_, c_] :=Position[# , Min[ # ]] &[Map[# . # &, Map[(x - #) &, c]]][[1, 1]];This can be broken down into individual steps thus:1. Map[(x-#)&,c] is the di�erence vector between the

input vector and each code vector.2. Map[#.#&,...] is the length-squared of each of

these di�erence vectors.3. Position[#,Min[#]]&[...][[1,1]] is the index of (the

�rst occurrence of) the minimum of these length-squareddi�erences.

Ties for the closest code vector are arbitrarily brokenby selecting the �rst of the closest code vectors that is en-countered. A fairer method would be to randomly breakties.The decoding algorithm is:1. Extract the code vector from the position in the

code book indexed by the code index.2. Return the result of step 1.An implementation of the decoding algorithm is (y =

codeindex, c = codebook, return = reconstructedvector):DecodeVQ[y_, c_] := c[[y]];

A key limitation of a VQ is that the code index isgenerated deterministically from the input vector, whichmeans that the code book size depends exponentially onthe input dimensionality, for a given reconstruction dis-tortion (per dimension): this prevents VQs from beingused to encode high dimensional data. This restrictionwill be lifted in Section A2 where stochastic vector quan-tisers are introduced.

2. Basic Stochastic Vector Quantiser (SVQ)

If a VQ is used to encode data, then to achieve a givenreconstruction distortion the size of the codebook mustscale exponentially as the dimensionality of the inputspace is increased. For high-dimensional input spaces,such as occur in image processing where the number ofimage pixels is the dimensionality, this exponential de-pendence is unacceptable.One possible solution to this scaling problem is to sim-

ply split the input space into a number of subspaces, eachof which has a low dimensionality, and to encode eachsuch subspace separately. The overall e�ect is to encodethe high-dimensional input vector using more than onecode index (one for each subspace).A more sophisticated approach to solving the scaling

problem is to automate the discovery of a suitable set ofsubspaces, because some choices are more e�ective thanothers. For instance, in image processing, a more e�ec-tive encoder is obtained by placing in the same subspacepixels that are correlated with each other. Because im-ages tend to have strong correlations between neighbour-ing pixels, the optimum subspaces tend to be subimagesof neighbouring pixels.The minimum requirement for the automated discov-

ery of subspaces is thus an encoder that inputs a high-dimensional vector and outputs more than one code index(i.e. a vector of code indices). From the above discus-sion, when this type of encoder is optimised, each codeindex will be typically associated with a subspace of theinput space.A stochastic vector quantiser (SVQ) has the required

properties. Thus an SVQ inputs a high-dimensional vec-tor, computes a probability distribution over all of thecode indices, and then draws samples from this proba-bility distribution. These samples are a vector of codeindices which is the output of the SVQ. The derivation

Page 18: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 18

in [5] shows how, in the special case where the input spaceis a 2-torus, each code index can then become associatedwith a subspace of the input space. If only a single sam-ple is drawn, then an SVQ is more lossy than a VQ withthe same codebook size. However, if more than one sam-ple is drawn, then an SVQ can be (and usually is) lesslossy than a VQ with the same codebook size, becausethe total amount of information in the stochastic vector

of code indices produced by an SVQ can exceed the infor-mation in the deterministic scalar code index producedby a VQ.

In order to allow the SVQ to behave stochastically, itis necessary for the SVQ encoder to compute an inputvector dependent probability distribution over all of thecode indices. It will be assumed that the SVQ encoder isimplemented using a set of sigmoid functions of the inputvector, whose weights and biasses comprise the encoder'scode book, to compute the (unnormalised) probabilitydistribution over all of the code indices. The use of sig-moid functions is not strictly necessary, because any setof non-negative functions of the input vector could inprinciple be used.

It can be shown analytically [9] that the optimal en-coder has probability distributions that depend in apiecewise linear fashion on the input vector. Thus theuse of sigmoid functions will lead to suboptimal results,but they are much easier to manipulate than piecewiselinear functions.

It will also be assumed that the SVQ decoder is imple-mented as a linear superposition of the reconstructionsobtained from each of the separate code indices. This isnot the most general dependence that the reconstructionvector can have on the code indices, so it will lead tosuboptimal results in which the reconstruction distortionis larger than it might otherwise have been. However,this type of superposition may be expected to work wellin situations where the input vector typically consists ofthe superposition of a �nite number of vectors, such asin image processing, where each image is typically thesuperposition of the images of a �nite number of objects(assuming that occlusion of one object by another maybe ignored).

A loss function that measures the average reconstruc-tion error will be used, so that the SVQ encoder anddecoder can be optimised. It will turn out that the keyproperty of being able to split into multiple parallel en-coder/decoders, each of which encodes a di�erent sub-space of the input, emerges automatically when this lossfunction is minimised. In fact, these emergent proper-ties are the means by which the appropriateness of a lossfunction is judged, and the average reconstruction errorloss function is the simplest one found so far with theright sort of emergent properties.

In Section A3 some miscellaneous routines are intro-duced, in Section A4 routines for encoding and decodingin an SVQ are given, in Section A5 the loss function thatis used to optimise the SVQ is given.

No gradient descent routines are given here for optimis-

ing an SVQ, because it is a special case of the generalisedSVQ which will be presented in the Section A6.

3. Basic SVQ: Miscellaneous Routines

An implementation of the sigmoid function algo-rithm which computes sigmoid responses of all of thecode indices to an input vector is (x = inputvector,w = weightmatrix, b = biasvector, return =sigmoidresponsevector):Sigmoid[x_, w_, b_] := 1

1 + Exp[- w . x - b];

This is mostly self-explanatory, however it must benoted that the operation 1

1+Exp[...] acts separately on each

component of its vector argument.An implementation of a sampling algorithm for pick-

ing samples at random from a discrete probability dis-tribution is (p = probabilitydistributionvector, n =numberofsamples, return = samplevector):

Sample[p_, n_] :=Table[Position[# - Random[], _ ? Positive][[1, 1]] - 1, {n}] &[FoldList[Plus, 0, p]];This can be broken down into individual steps thus:1. FoldList[Plus,0,p] is the cumulative sum of the

elements of p.2. (#-Random[])&[...] is the cumulative sum with

a random number (uniformly distributed in the interval[0, 1]) subtracted o�.

3. Position[...,_?Positive][[1,1]]-1 is the index of the�rst positive element in the cumulative sum.

4. Table[...,{n}] repeats this whole process n times.

4. Basic SVQ: Encoding/Decoding

An implementation of the algorithm for comput-ing the vector of posterior probabilities that eachpossible code index might be drawn next is (x =inputvector, w = weightmatrix, b = biasvector, return =posteriorprobabilityvector):PosteriorSVQ0[x_, w_, b_] := #

Apply[Plus, #]&[Sigmoid[x, w, b]];

This can be broken down into individual steps thus:1. Sigmoid[x,w,b] is the vector of sigmoidal responses.

2. #Apply[Plus,#]&[...] normalises it so that the sum of

its elements is unity.An implementation of the encoding algorithm is (x =

inputvector, w = weightmatrix, b = biasvector, n =numberofsamples, return = codeindexvector):EncodeSVQ0[x_, w_, b_, n_] := Sample[PosteriorSVQ0[x, w, b], n];

This can be broken down into individual steps thus:1. PosteriorSVQ0[x,w,b] is the vector of probabilities

that each code index is generated.2. Sample[...,n] is n samples drawn from this proba-

bility distribution.An implementation of the decoding algorithm is (y =

codeindexvector, r = reconstructionmatrix, return =reconstructedvector):

Page 19: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 19

DecodeSVQ0[y_, r_] :=Apply[Plus, r[[y]]]

Length[y];

This can be broken down into individual steps thus:

1. r[[y]] is a matrix formed from the rows of recon-struction matrix indexed by the elements of the code in-dex vector.

2. Apply[Plus,...] is the sum of the rows of this matrixwhich gives a linear superposition of contributions to thereconstruction.

3. (...)Length[y] normalises this sum by dividing out by

the number of code indices.

5. Basic SVQ: Loss Function

The loss function (for a given input vector) may becomputed using the following algorithm:

1. Compute the vector of probabilities of selectingeach code index.

2. Compute the contribution to the D1 part of theloss function (see below).

3. Compute the contribution to the D2 part of theloss function (see below).

4. Compute the sum of these contributions.

5. Return the result of step 4 (the loss function).

An implementation of the loss function algorithm is(x = inputvector, w = weightmatrix, b = biasvector, r =reconstructionmatrix, n = numberofsamples, return =lossfunction):

LossSVQ0[x_, w_, b_, r_, n_] :=Module[{p, D1, D2},p = PosteriorSVQ0[x, w, b];

D1 = 2nApply[Plus, p Map[# . # &, Map[x - # &, r]]];

D2 = 2(n - 1)n

# . # &[x - Apply[Plus, p r]];D1 + D2

];

The expression for D1 can be broken down into indi-vidual steps thus:

1. Map[x-#&,r] is the di�erence between the inputvector and each row of the reconstruction matrix.

2. Map[#.#&,...] is the length-squared of each ofthese di�erences.

3. p(...) weights each of these length-squared di�er-ences by a posterior probability.

4. Apply[Plus,...] sums up the elements of this vector,then 2

n (...) weights the result by 2n .

The expression for D2 can be broken down into indi-vidual steps thus:

1. p r weights each row of the reconstruction matrixby a posterior probability.

2. Apply[Plus,...] is a vector which is the sum the therows of this matrix.

3. x-(...) is the di�erence between the input vectorand this sum.

4. #.#&[...] is the length-squared of this di�erence.

5. 2(n−1)n (...) weights the result by 2(n−1)

n .

6. Full Stochastic Vector Quantiser (SVQ)

An SVQ can readily be further generalised to handlethe cases of limited network connectivity and leakage ofprobability [8].Limited network connectivity ensures that the SVQ

computations will scale sensibly as the size of the net-work is increased. The main non-local computation thatappears in an SVQ is the normalisation of the poste-rior probability that each possible code index might bedrawn next (see Section A4). This can be avoided bypartitioning up the code indices into overlapping subsets,and normalising within each subset only. A globally nor-malised posterior probability can then be constructed bysumming the posterior probabilities derived from each ofthese subsets [8].Leakage of probability allows topological constraints to

be imposed on the SVQ codebook, to allow topographicmappings to be trained [2, 3]. The basic trick here is tode�ne a code book topology by splitting the code indicesinto overlapping subsets (usually di�erent from the sub-sets used above), and to allow "cross-talk" between thecode indices in each of these subsets, such that mixingoccurs between the posterior probabilities that each pos-sible code index might be drawn next [2]. In order forthe SVQ to produce low loss codes under these adverseconditions it is essential that code indices that are cou-pled by cross-talk code for similar properties of the inputvector, because then the damaging e�ect of the cross-talkis reduced. This was referred to as the "robust hiddenlayer principle" in [2].In Section A7 routines for encoding and decoding in

an SVQ are given, in Section A8 the loss function thatis used to optimise the SVQ is given, in Section A9 thederivatives of this loss function with respect to the un-derlying parameters are given, and in Section A10 analgorithm for using these derivatives to update the pa-rameters is given. In addition, in Section A11 a usefulroutine for implementing "weight decay" is given.

7. Full SVQ: Encoding/Decoding

An implementation of the algorithm for comput-ing the vector of posterior probabilities that eachpossible code index might be drawn next is (x =inputvector, w = weightmatrix, b = biasvector,A = partitioningmatrix, L = leakagematrix, return =posteriorprobabilityvector):

PosteriorSVQ[x_, w_, b_, A_, L_] :=1

Length[A]L . (# 1

A . #. A) &[Sigmoid[x, w, b]];

This can be broken down into individual steps thus:1. Sigmoid[x,w,b] is the vector of sigmoidal responses.2. (# 1

A.# .A)&[...] is a vector which is the sum of

vectors formed by weighting the sigmoidal responses byeach row of the partitioning matrix (and then normalisingtheir sum to unity).

3. L.(...) is the leaked version of this vector.

Page 20: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 20

4. (...)Length[A] normalises this to produce a posterior

probability vector whose sum of elements is unity.

An implementation of the encoding algorithm is(x = inputvector, w = weightmatrix, b = biasvector,A = partitioningmatrix, L = leakagematrix, n =numberofsamples, return = codeindexvector):

EncodeSVQ[x_, w_, b_, A_, L_, n_] :=Sample[PosteriorSVQ[x, w, b, A, L], n]

This can be broken down into individual steps thus:

1. PosteriorSVQ[x,w,b] is the vector of probabilitiesthat each code index is generated.

2. Sample[...,n] is n samples drawn from this proba-bility distribution.

This is the same as the basic SVQ case, except thatPosteriorSVQ is used rather than PosteriorSVQ0.

An implementation of the decoding algorithm is (y =codeindexvector, r = reconstructionmatrix, return =reconstructedvector):

DecodeSVQ[y_, r_] := DecodeSVQ0[y, r];

8. Full SVQ: Loss Function

An implementation of the loss function algorithm is(x = inputvector, w = weightmatrix, b = biasvector, r =reconstructionmatrix, n = numberofsamples, return =lossfunction):

LossSVQ[x_, x1_, w_, b_, A_, L_, r_, n_] :=Module[{p, D1, D2},p = PosteriorSVQ[x, w, b, A, L];

D1 = 2nApply[Plus, p Map[# . # &, Map[x1 - # &, r]]];

D2 = 2(n - 1)n

# . # &[x1 - Apply[Plus, p r]];D1 + D2

];

This is the same as the basic SVQ case, except thatPosteriorSVQ is used rather than PosteriorSVQ0.

9. Full SVQ: Loss Function Derivatives

The expressions for the loss function derivatives arederived in Appendix B.

An implementation of the loss functionderivatives algorithm is (x = inputvector,w = weightmatrix, b = biasvector, A =partitioningmatrix, L = leakagematrix, r =reconstructionmatrix, n = numberofsamples, return ={lossfunction,derivativew.r.t.w,derivativew.r.t.b,derivativew.r.t.r}):

DLossSVQ[x_, x1_, w_, b_, A_, L_, r_, n_] :=Module[{M, dxr, e, q, Z, P, p, LT, LTe, Lp, LTr, PLTr, PT, PTPLTr,

Lpr, PLTe, pLTe, PTPLTe, dxLpr, d1, d2, dd1dq, dd2dq, dd1dr, dd2dr,dd1db, dd2db, dd1dw, dd2dw, c1, c2, D12, dD12dw, dD12db, dD12dr},

M = Length[A];dxr = Map[x1 - # &, r];e = Map[# . # &, dxr];q = Sigmoid[x, w, b];Z = A . q;

P =Map[q # &, A]

Z;

p = Apply[Plus, P];LT = Transpose[L];LTe = LT . e;Lp = L . p;LTr = LT . r;PLTr = P . LTr;PT = Transpose[P];PTPLTr = PT . PLTr;Lpr = Lp . r;PLTe = P . LTe;pLTe = p . LTe;PTPLTe = PT . PLTe;

dxLpr = x1 -LprM;

d1 =pLTeM;

d2 = # . # &[dxLpr];

dd1dq =LTe p - PTPLTe

M q;

dd2dq = - 1M q

(2(LTr p - PTPLTr) . dxLpr);

dd1dr = -2 dxr Lp

M;

dd2dr = - 1M(2Outer[Times, Lp, dxLpr]);

dd1db = q(1 - q)dd1dq;dd2db = q(1 - q)dd2dq;dd1dw = Outer[Times, dd1db, x];dd2dw = Outer[Times, dd2db, x];

c1 = 2n;

c2 = 2(n - 1)n

;D12 = {c1, c2} . {d1, d2};dD12dw = {c1, c2} . {dd1dw, dd2dw};dD12db = {c1, c2} . {dd1db, dd2db};dD12dr = {c1, c2} . {dd1dr, dd2dr};{D12, dD12dw, dD12db, dD12dr}

];

This has already been broken down into individualsteps which are detailed in Appendix B, so no furtherexplanation is necessary here.

10. Full SVQ: Parameter Update

An implementation of a gradient descent algorithm forminimising the loss function is (x = inputvector, w =weightmatrix, b = biasvector, A = partitioningmatrix,L = leakagematrix, r = reconstructionmatrix, n =numberofsamples, ε = updatestepsize, return ={lossfunction,neww,newb,newr}):

Page 21: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 21

UpdateSVQ[x_, w_, b_, A_, L_, r_, n_, ε_] :=Module[{dD, dim , dim1, dwmax, dbmax, drmax},

dD = DLossSVQ[x, x1, w, b, A, L, r, n];dim = Length[x];

dwmax =Sqrt[Max[Map[# . # &, dD[[2]]]]]

Sqrt[ dim ];

dbmax = Max[Abs[dD[[3]]]];

drmax =Sqrt[Max[Map[# . # &, dD[[4]]]]]

Sqrt[ dim ];

{dD1, w, b, r} - ε{0, dD[[2]]dwmax

, dD[[3]]dbmax

, dD[[4]]drmax

}];

The expression for dD yields a list comprising theloss function, its derivative w.r.t. the weight matrix, itsderivative w.r.t. the bias vector, and its derivative w.r.t.the reconstruction matrix.

The expression for dwmax can be broken down intoindividual steps thus:

1. dD[[2]] is the derivative of the loss function w.r.t.the weight matrix.

2. Map[#.#&,...] is the length-squared of each rowvector of this derivative matrix.

3. Sqrt[Max[...]] is the maximum of these lengths.

4. (...)Sqrt[dim] is a factor which would scale the derivative

w.r.t. the weight matrix so that the maximum of thelength-squared of each row vector is unity.

The expression for dbmax yields the maximum abso-lute element of the derivative w.r.t. the bias vector.

The expression for drmax is analogous to the expres-sion for dwmax.

The expression −ε{0, dD[[2]]dwmax ,

dD[[3]]dbmax ,

dD[[4]]drmax } yields the

update for the weight matrix, the bias vector, and the re-construction matrix, such that the maximum distance bywhich each row of the weight matrix and the reconstruc-tion matrix is moved is ε, and the maximum absolutevalue by which each element of the bias vector is ad-justed is ε. The �rst element is the update to the lossfunction which is not computed here, so it is set to zero.

This prescription whereby the updates are derivedfrom scaled versions of the derivatives means that theparameters are not updated according to a pure gradi-ent descent algorithm, because the update rates of eachtype of parameter (i.e. weight matrix, bias vector, andreconstruction matrix) are separately scaled.

It is not claimed that this prescription is optimal. Forinstance, it would be sensible to use a relatively smallervalue of ε for the update of reconstruction matrix, be-cause its optimal value depends on the weight matrixand the bias vector (and the training data) rather thanthe other way around.

11. Full SVQ: Weight Decay

Sometimes it turns out to be useful to introduce apenalty on large weight matrix elements, because thisencourages the formation of optimal solutions that havea small number of non-zero entries in the weight matrix.This is e�ectively a modi�cation of the loss function in

which an extra term is included that assigns a cost to theweight matrix element sizes.Weight decay could merely be used to encourage con-

vergence to the optimal solution when it it known inadvance that it must have a small number of non-zeroentries in the weight matrix, and the decay could beswitched o� as the solution converged towards one thatminimised the loss function.Alternatively, weight decay could be used more

strongly to enforce a prior bias towards solutions thathave few non-zero entries in the weight matrix, even ifthese are not the ones that actually minimise the lossfunction.An implementation of a weight decay algorithm is (w =

weightmatrix, λ = weightdecayparameter, return =neww):

Decay[w_,λ_] := MapThread[If[Sign[ #1 ] == Sign[ #2 ], #1 , 0] &, {w -λ #, #} &[Sign[w]], 2];

This can be broken down into individual steps thus:1. Sign[w] gives the signs of the weight matrix ele-

ments.2. {w-λ#,#}&[...]$ is the decayed weights and their

original signs.3. MapThread[If[Sign[#1]==Sign[#2],#1,0]&,(...)],2]

clips each decayed weight to zero if it has changed sign.Weight decay is most conveniently done immediately

after the parameters have been updated.

Appendix B: DERIVATIVES OF THE SVQ LOSSFUNCTION

In this appendix the expressions for the various deriva-tives of the SVQ loss function are derived using Mathe-

matica. These derivatives were originally published in[8].

1. Basic SVQ: Loss Function

In order to establish the steps to use in the derivation,useMathematica to di�erentiate the basic SVQ loss func-tion, whilst avoiding the additional complications thatarise in the full SVQ loss function.

2. Basic SVQ: De�ne the Basic Quantities

Clear out any preexisting de�nitions.Clear[Pr , D1, D2];De�ne the posterior probability.Pr [y_] := q[y]

Sig[q[y1], y1];

De�ne the D1 and D2 parts of the loss function (omit-

ting the 2n and 2(n−1)

n factors, respectively)

D1 := Sig[Pr [y]Sig[(x[i] - r[y, i])2, i], y];D2 := Sig[(x[i] - Sig[r[y, i]Pr [y], y])2, i];

Page 22: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 22

Di� is a custom di�erentiation operator and Sig is acustom summation operator, which are used to avoid theimplicit properties that Mathematica assigns to its owncorresponding D and Sum operators.

3. Basic SVQ: Properties of Summation andDi�erentiation Operators

Clear out any preexisting de�nitions for the customdi�erentiation (Di�) and summation (Sig) operations.Clear[Diff, Sig];

Properties needed for derivatives with respect to q.These properties can be added interactively during aMathematica session in order to progressively simplifythe various terms that arise when di�erentiating the lossfunction. The Del operator is a custom Kronecker delta,which is used in intermediate steps of the simpli�cation.

Diff[Sig[x_, y_], z_] := Sig[Diff[x, z], y];

Diff[ x_y_, z_] := 1

yDiff[x, z] - x

y2Diff[y, z];

Diff[x_ y_, z_] := y Diff[x, z] + x Diff[y, z];Diff[q[y_], q[z_]] := Del[y, z];Sig[Del[x_, y_], x_] := 1;Diff[x_2, y_] := 2x Diff[x, y];Diff[-x_ + y_, z_] := - Diff[x, z] + Diff[y, z];Diff[r[x__], q[y_]] := 0;Diff[x[y_], q[z_]] := 0;Sig[0, x_] := 0;Sig[-x_ + y_, z_] := - Sig[x, z] + Sig[y, z];Sig[Del[x_, y_]z1_, x_] := z1 /. x→y;

Sig[ x_y_, z_] := 1

ySig[x, z] /; FreeQ[y, z];

Properties additionally needed for derivatives with re-spect to r.

Diff[q[x_], r[y__]] := 0;Diff[r[w_, x_], r[y_, z_]] := Del[w, y]Del[x, z];Diff[x[y_], r[z__]] := 0;

4. Basic SVQ: Derivatives

The derivative of D1 with respect to q is obtained thus:Diff[D1, q[ψ]]

-Sig[q[y] Sig[(- r[y, i] + x[i])2, i], y]

Sig[q[y1], y1]2+

Sig[(- r[ψ, i] + x[i])2, i]Sig[q[y1], y1]

The derivative of D2 with respect to q is obtained thus:Diff[D2, q[ψ]]

Sig[2 (- r[ψ, i]Sig[q[y1], y1]

+Sig[q[y] r[y, i], y]

Sig[q[y1], y1]2)

(-Sig[q[y] r[y, i], y]

Sig[q[y1], y1]+ x[i]), i]

The derivative of D1 with respect to r is obtained thus:Diff[D1, r[ψ, j]]

-2 q[ψ] (- r[ψ, j] + x[j])

Sig[q[y1], y1]

The derivative of D2 with respect to r is obtained thus:Diff[D2, r[ψ, j]]

-2 q[ψ] (-

Sig[q[y] r[y, j], y]Sig[q[y1], y1]

+ x[j])

Sig[q[y1], y1]

These derivatives could be simpli�ed further, but theyhave been computed solely to demonstrate how Mathe-

matica can be applied to di�erentiating the basic SVQ

loss function, so it is appropriate to pass on without fur-ther ado to the more general case of the full SVQ lossfunction.

5. Full SVQ: Loss Function

Use Mathematica to di�erentiate the full SVQ lossfunction.

6. Full SVQ: De�ne the Basic Quantities

Clear out any preexisting de�nitions.

Clear[Pr , D1, D2];De�ne the posterior probability.

Pr [y_] :=1MSig[L[y, y2]q[y2]Sig[ 1

Sig[A[k, y1]q[y1], y1]A[k, y2], k], y2];

De�ne the D1 part of the loss function (omitting the2n factor).

D1 := Sig[Pr [y]Sig[(x[i] - r[y, i])2, i], y];De�ne the D2 part of the loss function (omitting the

2(n−1)n factor).

D2 := Sig[(x[i] - Sig[r[y, i]Pr [y], y])2, i];

7. Full SVQ: Properties of Summation andDi�erentiation Operators

Clear out any preexisting de�nitions for the customdi�erentiation (Di�) and summation (Sig) operations.

Clear[Diff, Sig];

Properties needed for derivatives with respect to q.These properties can be added interactively during aMathematica session in order to progressively simplifythe various terms that arise when di�erentiating the lossfunction. These include the properties de�ned for thebasic SVQ, plus the following new properties.

Diff[M, q[x_]] := 0;Diff[L[x__], q[y_]] := 0;Diff[A[x__], q[y_]] := 0;Sig[w_(x_ + y_), z_] := Sig[w x, z] + Sig[w y, z];Sig[-x_, y_] := - Sig[x, y];

Properties additionally needed for derivatives with re-spect to r. These include the properties de�ned for thebasic SVQ, plus the following new properties.

Diff[M, r[x__]] := 0;Diff[r[w_, x_], r[y_, z_]] := Del[w, y]Del[x, z];Diff[L[x__], r[y__]] := 0;Diff[q[x_], r[y__]] := 0;Diff[A[x__], r[y__]] := 0;Diff[x[y_], r[z__]] := 0;

Page 23: A User's Guide to Stochastic Encoder/Decoders · A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice

A User's Guide to Stochastic Encoder/Decoders 23

8. Full SVQ: Simpli�cation Rules

If the derivatives of D1 and D2 are evaluated using theproperties de�ned above, then the resulting expressionsare still quite large, and have many common subexpres-sions. It is therefore useful to de�ne a number of sim-pli�cation rules to rewrite the results in a more compactnotation.

r1 = Sig[A[x_, y_] q[y_], y_]→Z[x];

r2 =A[x_, y_]Z[x_]

→ P[x, y]q[y]

;

r3 =A[x_, y_]A[x_, t_]

Z[x_]2→ P[x, y]

q[y]P[x, t]q[t]

;

r4 = Sig[P[x_, y_], x_]→p[y];r5 = Sig[P[x_, y_] P[x_, z_], x_]→PTP[z, y];

r6 = Sig[(- r[y_, z_] + x[z_])2, z_]→e[y];r7 = Sig[L[y_, x_] PTP[z_, x_], x_]→PTPLT[z, y];r8 = Sig[e[y_] PTPLT[x_, y_], y_]→PTPLTe[x];r9 = Sig[w_ x_ y_, z_]RuleDelayedy Sig[w x, z] /; FreeQ[y, z];r10 = Sig[e[x_] L[x_, y_], x_]→LTe[y];r11 = Sig[w_ x_ y_, z_]RuleDelayedx Sig[w y, z] /; FreeQ[x, z];r12 = Sig[L[y_, x_] r[y_, z_], y_]→LTr[x, z];r13 = Sig[L[y_, z_] p[z_], z_]→Lp[y];r14 = Sig[Lp[y_] r[y_, x_], y_]→Lpr[x];r15 = Sig[r[y_, i_] Sig[L[y_, y2_] PTP[z_, y2_], y2_], y_]→Sig[PTP[z, y2]Sig[L[y, y2]r[y, i] , y], y2];r16 = Sig[LTr[y_, x_] PTP[z_, y_], y_]→PTPLTr[z, x];

9. Full SVQ: Derivatives

The derivative of D1 with respect to q is obtained thus:

Diff[D1, q[ψ]] //. {r1, r2, r3, r4, r5, r6, r7, r8, r9, r10} // Expand

LTe[ψ] p[ψ]M q[ψ] - PTPLTe[ψ]

M q[ψ]

The derivative of D2 with respect to q is obtained thus:

Diff[D2, q[ψ]] //.{r1, r2, r3, r4, r5, r11, r12, r13, r14, r15, r16} // Expand

2 p[ψ] Sig[Lpr[i] LTr[ψ, i], i]M2 q[ψ] -

2 Sig[Lpr[i] PTPLTr[ψ, i], i]M2 q[ψ] -

2 p[ψ] Sig[LTr[ψ, i] x[i], i]M q[ψ] +

2 Sig[PTPLTr[ψ, i] x[i], i]M q[ψ]

The derivative of D1 with respect to r is obtained thus:

Diff[D1, r[ψ, j]] //. {r1, r2, r4, r13}

-2 Lp[ψ] (- r[ψ, j] + x[j])

M

The derivative of D2 with respect to r is obtained thus:

Diff[D2, r[ψ, j]] //. {r1, r2, r4, r13, r14}

-2 Lp[ψ] (-

Lpr[j]M

+ x[j])

M

These are the derivatives that are used in Section A9.

[1] Y. Linde, A. Buzo and R. M. Gray, IEEE Trans. Com-mun., 28, 84, (1980).

[2] S. P. Luttrell, IEEE Trans. Neural Networ., 1, 229, (1990).[3] T. Kohonen, Self-Organising Maps (Springer-Verlag,

Berlin, 1997).[4] S. Wolfram, The Mathematica Book (University Press,

Cambridge, 1999).[5] S. P. Luttrell, in Combining Arti�cial Neural Nets: En-

semble and Modular Multi-Net Systems (Perspectives inNeural Computing), edited by A. J. Sharkey (Springer-Verlag, London, 1999), p. 235.

[6] A. Hyvärinen, Neural Comp. Surveys, 2, 94, (1999).[7] G. E. Hinton and Z. Ghahramani, Philos. Trans. Roy. Soc.

B, 352, 1177, (1997).[8] S. P. Luttrell, inMathematics of Neural Networks: Models,

Algorithms and Applications, edited by S. W. Ellacott, J.C. Mason and I. J. Anderson (Kluwer, Boston, 1997), p.240.

[9] S. P. Luttrell, in Proc. Int. Conf. on Arti�cial Neural Net-works, edited by . <Last> (???, ???, 1999), p. 198.