pdfs.semanticscholar.org€¦ · ac kno wledgmen ts it is m y pleasure to thank prof. klaus ob erma...

Technische Universit�at BerlinFB Physik, Institut f�ur Theoretische PhysikFB Informatik, FG Neuronale InformationsverarbeitungStatistical PhysicsofClustering AlgorithmsDiplomarbeitThore GraepelMatr.-Nr.: 171822April 1998

AufgabenstellerProf. Dr. Eckehard Sch�oll ZweitgutachterProf. Dr. Klaus Obermayer

AcknowledgmentsIt is my pleasure to thank Prof. Klaus Obermayer for an excellent supervision with numerousdiscussions and fruitful suggestions. I would also like to thank Matthias Burger, who was mycollaborator in the work on which Chapter 4 and Chapter 5 are based. Both Matthias Burgerand Ralf Herbrich proofread the manuscript and provided moral support. Also, I am indebtedto Prof. Eckehard Sch�oll for o�cially supervising a thesis which lies at the interdisciplinaryboundary between statistical physics, statistics, and neuroinformatics.I would like to mention that this work was partly funded by the Technical Universityof Berlin via the Forschungsinitiativprojekt FIP 13/41. Also, I would like to thank theStudienstiftung des Deutschen Volkes for the support of my studies.Finally, many thanks go to my parents who are the best \sponsors" in every respect.Without them there would be no \me" and this thesis would not have come into existence inthe �rst place.

Contents1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.3 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.5 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 What is Clustering? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Applications of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Methods of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4.1 Non-Parametric Approaches . . . . . . . . . . . . . . . . . . . . . . . . 41.4.2 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.3 Reconstructive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5.1 Primary Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 The Probabilistic Autoencoder Framework 82.1 Probabilistic Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Unsupervised Learning and Autoencoders . . . . . . . . . . . . . . . . 82.1.2 The General Folded Markov Chain . . . . . . . . . . . . . . . . . . . . 82.2 Derivation of the Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 The Two Stage Folded Markov Chain . . . . . . . . . . . . . . . . . . 92.2.2 Cost Function for TMP . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 The Squared Euclidean Distance Distortion . . . . . . . . . . . . . . . 122.2.4 Cost Function for TVQ . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Optimization 143.1 Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Deterministic Annealing and EM . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Deterministic Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Soft Topographic Vector Quantization 234.1 Derivation of the STVQ-Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 234.1.1 Stationarity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1.2 Deterministic Annealing and the EM Algorithm . . . . . . . . . . . . 244.1.3 Derivatives of the STVQ-Algorithm . . . . . . . . . . . . . . . . . . . 254.2 STVQ for Noisy Source Channel Coding . . . . . . . . . . . . . . . . . . . . . 264.2.1 Transmission via a Noisy Channel . . . . . . . . . . . . . . . . . . . . 264.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Phase Transitions in STVQ 305.1 Initial Phase Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Automatic Selection of Feature Dimensions . . . . . . . . . . . . . . . . . . . 315.2.1 Phase Transition in the Discrete Case . . . . . . . . . . . . . . . . . . 335.2.2 Continuous Gaussian Case . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3.1 Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.3.2 Annealing of a Two-Dimensional Array of Cluster Centers . . . . . . . 385.3.3 Automatic Selection of Feature Dimensions for a Chain of Clusters . . 406 Kernel-Based Soft Topographic Mapping 466.1 Clustering in High Dimensional Feature Space . . . . . . . . . . . . . . . . . . 466.2 The Kernel Trick and Mercer's Condition . . . . . . . . . . . . . . . . . . . . 466.2.1 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2.2 Mercer's Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2.3 Admissible Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . 486.3 Derivation of Kernel-Based Soft Topographic Mapping . . . . . . . . . . . . . 496.3.1 Topographic Clustering in Feature Space . . . . . . . . . . . . . . . . 496.3.2 Application of Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . 496.3.3 EM and Deterministic Annealing . . . . . . . . . . . . . . . . . . . . . 506.3.4 Critical Temperature of the First Phase Transition . . . . . . . . . . . 516.4 Numerical Simulations using the RBF Kernel . . . . . . . . . . . . . . . . . . 516.4.1 E�ect of the RBF Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 516.4.2 Simulation Results with Handwritten Digit Data . . . . . . . . . . . . 526.4.3 Conclusion on STMK . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 Soft Topographic Mapping for Proximity Data 557.1 Topographic Clustering on Pairwise Dissimilarity Data . . . . . . . . . . . . . 557.2 Derivation of Soft Topographic Mapping for Proximity Data . . . . . . . . . . 557.2.1 Mean-Field Approximation . . . . . . . . . . . . . . . . . . . . . . . . 557.2.2 EM and Deterministic Annealing . . . . . . . . . . . . . . . . . . . . . 577.2.3 SOM Approximation for STMP . . . . . . . . . . . . . . . . . . . . . . 587.3 Critical Temperature of the First Phase Transition . . . . . . . . . . . . . . . 587.4 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.4.1 Toy Example: Noisy Spiral . . . . . . . . . . . . . . . . . . . . . . . . 607.4.2 Topographic Map of the Cat's Cerebral Cortex . . . . . . . . . . . . . 627.4.3 Conclusion on STMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8 Conclusion 678.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 Appendix 699.1 Proof of the Fixed-Point Property . . . . . . . . . . . . . . . . . . . . . . . . 699.2 Derivation of the Symmetry Properties of the Assignment Correlations . . . . 709.3 Evaluation of the Assignment Correlation for Gaussian Neighborhood Functions 719.4 Derivation of Mean-�eld Equations . . . . . . . . . . . . . . . . . . . . . . . . 72Bibliography 74

Chapter 1Introduction1.1 Background1.1.1 MotivationThe advent of the Information Age has reshaped the world and brings huge amounts of datato our �ngertips. Computers are capable of performing massive calculations in a few seconds,storage capacity increases at an immense rate, and networks of computers, in particular theInternet, make it possible to access data from all over the world. These profound changesmake it necessary to develop new tools for dealing with the avalanche of available data. Afterall, the data does not comprise a value in itself: Depending on the user's demand, data has tobe located, retrieved, processed, visualized, and { �nally { understood in order to be useful.Due to the sheer amount of data and its complexity, we need intelligent methods to extractmeaningful information from the data.1.1.2 Machine LearningOne interesting approach in the �eld of data analysis and processing is the idea of machinelearning. According to Valiant [125] \a program for performing a task has been acquired bylearning if it has been acquired by any means other than explicit programming". In view of theabove described situation of an abundance of available data, the machine learning paradigm isof particular appeal, because it would allow for a relatively automatic acquisition of knowledgeby computers which could therefore ideally assist humans in using the full potential of theavailable data. All that is necessary is a su�cient amount of training examples, from whichthe relevant task can be learned. In particular, neural network approaches, which employcomputational principles from nature, have shown to yield good generalization, i.e. the abilityto abstract from given examples. Researchers have identi�ed three main paradigms of learningthat correspond to di�erent learning situations: supervised learning, reinforcement learning,and unsupervised learning (see [50] for textbook account).1.1.3 Supervised LearningSupervised learning [7] is concerned with the solution of classi�cation and regression tasks, inwhich the training examples are given as pairs corresponding to input values and target values.The task of supervised learning is to learn from the examples the general relation betweeninput and target so as to be able to predict the target value for a given new input value.An interesting and useful example for classi�cation is the recognition of handwritten digitsfrom gray-value images that were obtained from handwritten addresses on mail envelopes

CHAPTER 1. INTRODUCTION 2[22]. Regression problems appear frequently in the empirical sciences, where measurementsor observations sample the input space and the researcher is interested in a continuous relationbetween input and target space. Methods range from traditional statistics approaches [85]over neural network architectures [47] to support vector machines [127].1.1.4 Reinforcement LearningReinforcement learning [4] is concerned with a di�erent scenario, in which the learner takesactions and receives rewards from the environment. E�ectively, the learner receives only feed-back for his actions but is not provided with the correct answer or target value for a givensituation. The task then is to select actions according to a strategy in such a way as to max-imize the total amount of payo�. This paradigm is thought to be more suitable for modelinganimal learning, because in nature there is no explicit teacher either. However, reinforce-ment learning has been successfully applied to problems such as dynamic resource allocation[116] and game playing [118] as well. The most popular method is temporal di�erence (TD)learning, which was introduced by Sutton [117].1.1.5 Unsupervised LearningThe third paradigm is the one this work will be concerned with and is called unsupervisedlearning. The learner is provided with samples from a distribution of data, but receives neithertarget values nor reinforcements. The task is to �nd a suitable representation of the underlyingdistribution of the data. Di�erent objectives have been formulated in order to make this taskconcrete. They include redundancy reduction [95], information maximization [6], minimumcross entropy [87], and minimum reconstruction error [103], some of which will be discussedlater in Section 1.4. Unsupervised learning has various applications such as preprocessing forsupervised learning [20], data compression [104], density estimation [68], source separation [6],and data visualization [10]. One major approach to unsupervised learning is data clustering,which will be the focus of this thesis.1.2 What is Clustering?Clustering methods [99] aim at partitioning a set X of D data items xi into N groups Cr suchthat data items that belong to the same group are more alike than data items that belongto di�erent groups. These groups are called clusters and their number N may be preassignedby the analyst or can be a parameter to be determined by the algorithm. The result of thealgorithm is thus an injective mapping X ! C of data items xi to clusters Cr. One interestingway of looking upon clustering is to consider it as classi�cation without labels: the clustersrepresent di�erent classes, but the class labels are treated as missing data.From the formulation of the problem it is clear that it is crucial for clustering to de�ne anappropriate distance or dissimilarity measure dij on the data items. In the most popular case,the data items are represented by Euclidean feature vectors xi 2 <d on which the squaredEuclidean distance dij = kxi � xjk2 serves as a natural distance measure. In this case thereis a close connection to vector quantization (VQ) [74], because for each cluster Cr a clustercenter or code vector wr 2 <d can be found, which serves as a representative in data space ofall the data items assigned to cluster Cr. These vectors thus quantize the continuous vectorspace (see Chapter 4) and provide a condensed representation of the data. Also, distance

CHAPTER 1. INTRODUCTION 3measures other than the squared Euclidean distance can be de�ned on the feature vectorsto bring out their structure [71]. The most general approach, however, does not assume anexplicit representation of the data items, e.g. in the form of feature vectors, but takes asinput only the pairwise dissimilarities dij between the items of the data set. This in turnmakes it impossible to determine a representative as with vector quantization and the resultof the clustering is given by the obtained mapping X ! C, only.Extensions of the above minimal clustering paradigm are topographic clustering and hier-archical clustering. Topographic clustering algorithms (see [70] for the most prominent repre-sentative) try to additionally preserve or extract information about the proximities betweenclusters. Hierarchical clustering schemes aim at the construction of a tree, that representsclusters at di�erent length scales in the data [87].1.3 Applications of ClusteringThere exist many applications of clustering in such diverse �elds as pattern recognition (see[59] for texture segmentation), communications (see [74] for vector quantization), biochem-istry (see [54] on proteins), psychology (see [73]), and business ([113]). These applicationsfall broadly into three categories: data analysis and visualization, data compression, andpreprocessing.When faced with a large data set of feature vectors or pairwise dissimilarity values, it is areasonable strategy to �rst look for cluster or group structure in the data. Such a structurecan be seen as the skeleton of the data and can serve as a basis for further exploration.Not only makes it available a criterion for partitioning the data and analyzing the groupsseparately. In the case of Euclidean feature vectors the obtained cluster centers are featurevectors in the same space as the data and can be interpreted as prototypes of the respectivegroup of data. If, in addition, the clustering algorithm provides information about hierarchicalor topographic cluster structure, interesting aspects of the data can be revealed.In communications and data storage, it is often important to compress data, becausethe bandwidth of communication channels or the capacity of storage devices is constrained.One way to achieve this is vector quantization, which is a special case of clustering thatprovides code vectors wr as representatives of data vectors xi. Once a codebook has beentransmitted or stored, data items xi need no longer be characterized by their correspondingfeature vectors xi, but can be referred to by the index r of the relevant code vector wr. If thedata is well represented by the set of code vectorsW , good compression ratios can be achieved.In particular, this method can be extended to the optimization of the codebook with respectto the transmission errors induced by channel noise (source channel coding), which leads totopographic vector quantization as discussed in Chapter 4. The representation can even bemade robust w.r.t. the loss of code labels in the transmission, i.e. against sudden changes inthe bandwidth (see neural gas [56]). Also hierarchical schemes allow for a transmission of,e.g., image data at progressive resolution levels.Finally, clustering methods can serve as preprocessing for supervised learning techniques[20] and control tasks [101]. The idea is similar to that of data compression: to �nd areduced representation of the data for a more e�cient application of supervised methods.This approach has, e.g., been used for texture segmentation in a robotics application [59].Also clustering techniques are used to initialize the basis functions in radial basis functionnetworks [7] or to preprocess the data in counterpropagation networks [48]. With regard

CHAPTER 1. INTRODUCTION 4to these preprocessing applications, the topographic mapping, i.e. the preservation of spatialinformation between clusters, can provide important information as shown in [20] for characterrecognition or in [101] for another robotics application.1.4 Methods of ClusteringAlthough many di�erent methods for data clustering have been tried, two main approachescan be identi�ed: parametric and non-parametric clustering. Parametric methods make atleast some assumptions about the underlying data structure. In general, these assumptionsare incorporated into a global optimality criterion or cost function, which these methods aimto minimize [11]. This class of algorithms can be further divided into generative models andreconstructive models [13], of which the latter will be the focus of this work. Non-parametricmethods make fewer assumptions about the data structure and, typically, follow some localcriterion for the construction of the clusters. Since they do not involve learning in the senseof parameter adaptation, they are discussed here for reasons of completeness only. In thefollowing a brief review will be given of the clustering methods in common use.1.4.1 Non-Parametric ApproachesTypical examples of the non-parametric approach to clustering are the agglomerative anddivisive algorithms that produce dendrograms. The agglomerative version takes at each stepthe two clusters with smallest dissimilarity and merges them, starting from clusters consistingof single data items. The procedure then depends on how to de�ne the dissimilarity betweenclusters consisting of more than one data point. Here essentially three methods are in use:In single-link clustering, at level � only one pair of data items between clusters needs to becloser than � in order to merge two clusters. In contrast, in complete-link clustering, all suchpairs need to be closer than �, which leads to compacter clusters. Also a group-averagingmethod has been used [99].The divisive version of clustering works along the opposite direction: Starting from onecluster containing all data items, this cluster is successively split into smaller clusters ac-cording to some dissimilarity criterion, e.g. to maximize the dissimilarity between clusters[99]. Although this method can be applied in a non-parametric setting, it was also appliedin tree-structured vector quantization by [87], and also the soft clustering algorithm [103],which is a special case of the STVQ algorithm discussed in Chapter 4, can be considered asa divisive algorithm.One of the latest developments in non-parametric clustering employs a pleasant analogy tothe physical properties of a model granular magnet [11]. A Potts spin is assigned to each dataitem, and the interactions between spins are taken as a decreasing function of the distancebetween items. Depending on the temperature the system exhibits three distinct phases:ferromagnetic, super-paramagnetic, and paramagnetic. In the super-paramagnetic phase,where clusters of spins with strong interactions become ordered, the spin-spin correlationfunction is then used to partition the spins and the corresponding data items into clusters.An important advantage of non-parametric clustering methods is that they do not makeany assumptions about the underlying data distribution. Also, in general, they do not requirea Euclidean vector representation of the data items, but only use pairwise dissimilarities.However, it is known, that they perform poorly, when clusters overlap or vary in shape,density, or size [11].

CHAPTER 1. INTRODUCTION 51.4.2 Generative ModelsThe basic idea of generative models [7, 120, 9] for clustering is to de�ne a parameterizedmixture p(x) =PNr=1 p(xjr)P (r) of su�ciently simple probability densities p(xjr), which aremost often taken to be Gaussians p(xjr) = (2��2r)�d=2 exp(�kx � wrk2=2�2r) and to try toadjust its parameters, i.e. the cluster centers wr, the cluster variances �2r , and the mixingcoe�cients P (r) in such a way as to achieve a good match with the distribution of theinput data. This can be achieved by maximizing the data likelihood (ML) or the posterior(MAP) if additional prior information on the parameters is available [123, 124]. E�cientEM-schemes [129] exist to perform the optimization. If additional smoothness constraintsare imposed on the cluster centers, this approach can also be used to perform a so calledgenerative topographic mapping (GTM) [9], which preserves information about the spatialrelations between clusters.Generative mixture models o�er several advantages due to the fact that they de�ne aproper probability distribution in data space. Not only can they be used for (conditional)density estimation [7], but due to their probabilistic nature they also provide means fordealing with the problem of missing data [34] and active data selection [21]. Disadvantagesare that preknowledge about the data to be modeled needs to be incorporated into the mixturedensities, and that the whole approach is (as yet) limited to data given as Euclidean featurevectors as opposed to, e.g. pairwise dissimilarity data (see Chapter 7).1.4.3 Reconstructive MethodsReconstructive clustering methods are generally based on a cost function, which in someway incorporates the loss of information incurred by the clustering procedure when tryingto reconstruct the original data from the compressed cluster representation [13]. If the datais given in the form of Euclidean data vectors, the task of �nding a cluster representation,which allows for as accurate a reconstruction of the original data as possible, closely relatesclustering to vector quantization [99]. In both cases the resulting distortion can be measuredas the squared Euclidean distance between the original data vectors and their reconstructions.The most basic algorithms to optimize a cost function as described are the k-means al-gorithm [81], which uses an online learning rule for stochastic approximation [102], and theLBG-algorithm [74], which is a batch scheme for vector quantization. Both these algorithmshave the drawback of being greedy in the sense that they tend to get stuck in local minima ofthe cost function, but this issue will be addressed later, because the cost function on whichboth k-means and LBG are based is a special case of the TVQ cost function discussed in thiswork (see Chapter 4). As one solution to the problem of local minima stochastic methodswith a close relation to statistical physics like simulated annealing [66] were applied to clus-tering optimization problems [99], but these are often slow due to their stochastic search ofthe multidimensional search space (see Chapter 3).As an alternative, the idea of deterministic annealing was developed [130, 132, 131, 37],which uses temperature annealing, but avoids the stochastic search and replaces it by adeterministic search for local minima of the free energy at a given temperature (see Chapter 3).This leads to e�cient clustering algorithms [103, 15] with the additional property, that duringthe annealing a natural hierarchy of clusters emerges in a divisive manner. In order to imposestructural constraints on the emerging tree, hierarchical clustering algorithms were developed,which apply the principle of minimum cross entropy and use informative priors to approximate

CHAPTER 1. INTRODUCTION 6the unstructured clustering solution [87, 57, 58] (see [75, 78] for a di�erent approach, in whichthe output of one vector quantizer is fed into the next one in the hierarchy).Another extension to the basic clustering paradigm are topographic clustering algorithms[86, 76, 77, 79, 51, 52, 15, 16], which are also the main topic of this thesis. While in k-means clustering, the cost function is invariant under permutation of the cluster indices, intopographic clustering this symmetry is broken, and a prede�ned neighborhood is imposedon the clusters. As a result neighboring clusters represent close volumes of data space, whichhelps visualization and { for vector quantization { makes the representation robust w.r.t.noise on the cluster indices (see Chapter 4, Section 4.2 for details)It is at this point, that the connection between clustering algorithms and neural networksis most evident. The Self-Organizing Map (SOM) [70, 69, 71, 72], which is one of the mostin uential types of neural networks, can be interpreted as a topographic clustering algorithm.The SOM was originally formulated by Kohonen as a heuristic online learning scheme, whichserved as a model for developmental processes in the brain [94, 93]. The model consists ofneurons on a lattice, that de�nes neighborhood relations on the neurons. For each incomingdata vector, the neuron, whose weight vector is closest to the data vector is called the winner.Its weight vector, and the weight vectors of its neighbors are updated in the direction ofthe data vector, thus leading to a topographic mapping of the data space. Luttrell [76,79] showed, that the SOM can be seen as a computationally e�cient approximation to astochastic optimization of a cost function for topographic clustering (see Chapter 4 for details).In particular, [79] provides a framework for probabilistic autoencoders in terms of foldedMarkov chains, which serves well to derive cost functions for e�cient coding mechanisms (seeChapter 2 for details).While the aforementioned variants and extensions of \vanilla" reconstructive clusteringall operate on a Euclidean data space directly, other settings are conceivable. One idea isto perform the clustering not in the originally given data space, but in a feature space thatis related to data space in a nonlinear fashion. This would correspond to a preprocessingof the data and could be computationally expensive if the feature space had a much higherdimensionality than the data space. This problem has recently been solved for algorithmsthat can be expressed solely in terms of scalar products in data space by the application of theso called potential function method [1], whose generalization is now often called the \kerneltrick" (e.g. support vector machine [127, 22] or nonlinear principal component analysis [111]).As Sch�olkopf [112] points out, this approach can also be used in clustering (see Chapter 6 forthe generalization to topographic clustering).As mentioned in Section 1.2 a very general representation of a data set for the purpose ofclustering is a dissimilarity matrix containing dissimilarity values for all pairs of data items.Clustering cost functions for pairwise dissimilarity data had been discussed in the patternrecognition literature [26] earlier, but the work of Buhmann et al. has been most in uencial[14, 53, 59, 54]. These authors essentially formulate a cost function for dissimilarity data andoptimize it using the technique of deterministic annealing (see Chapter 7 for an extension totopographic clustering).

CHAPTER 1. INTRODUCTION 71.5 Overview of the Thesis1.5.1 Primary ObjectiveThis thesis presents a principled approach to clustering by formulating the problem in aprobabilistic autoencoder framework which is based on folded Markov chains. As a suitableoptimization technique deterministic annealing is introduced, which performs robust opti-mization based on an analogy to the cooling of a system in statistical physics. Applicationof deterministic annealing to the derived clustering cost functions leads to three algorithms:soft topographic vector quantization (STVQ), which performs topographic clustering on Eu-clidean feature vectors, kernel-based soft topographic mapping (STMK), which allows to dothe clustering in a high dimensional Euclidean feature space by the application of the kerneltrick, and soft topographic mapping for proximity data (STMP), which generalizes STVQ toarbitrary pairwise dissimilarity data in a mean �eld fashion. All three algorithms are analysedw.r.t. the annealing process and their application is demonstrated on both arti�cial and realworld data.1.5.2 Structure of the ThesisThe remainder of the thesis will be structured as follows. In Chapter 2 two clustering costfunctions for pairwise dissimilarity data and for Euclidean data are derived in a probabilisticautoencoder framework based on folded Markov chains (chapter based on [45]). In Chapter 3available optimization techniques are introduced, with a focus on simulated annealing, whichserves as a conceptual basis for deterministic annealing and the EM-algorithm. In Chapter 4deterministic annealing is applied to the TVQ cost function and the soft topographic vectorquantization (STVQ) algorithm is derived. STVQ is shown to serve as a unifying frameworkfor known clustering algorithms and is applied to the problem of the compression of imagedata to be sent via a noisy transmission channel (chapter based on [41, 42, 17, 18, 43]). InChapter 5 the annealing process is analyzed in detail, and the critical temperatures of phasetransitions are calculated in terms of the covariance matrix of the data and the coupling matrixof the clusters. The analytical results are veri�ed using computer simulations (chapter basedon [42]). In Chapter 6 the STVQ algorithm is extended to high dimensional feature spaces,which are related to data space in a non-linear fashion, by the application of the kerneltrick. The working of this kernel-based soft topographic mapping (STMK) is demonstratedon handwritten digit data (chapter based on [44, 43]). In Chapter 7 the STVQ algorithm isgeneralized to pairwise dissimilarity data by approximating the corresponding cost functionin a mean �eld fashion. The resulting algorithm soft topographic mapping for proximity data(STMP) is applied to arti�cal Euclidean data and to a dissimilarity matrix of areas of thecat's cerebral cortex (chapter based on [45, 43]). Chapter 8, �nally, summarizes the workpresented and discusses ideas for future work.

Chapter 2The Probabilistic AutoencoderFramework2.1 Probabilistic Autoencoders2.1.1 Unsupervised Learning and AutoencodersOne way of looking upon unsupervised learning is to consider autoencoders or autoassociativeneural networks [50, 82]. The basic idea is to force a learning machine to replicate its input,i.e. to train it to learn the identity map. This would be trivial, of course, if no constraintswere to be obeyed. In the paradigm of feedforward neural networks the constraint is typicallya bottleneck in the feedforward connections, which limits the bandwidth of transmission andtherefore forces the network to learn an e�cient representation of the input data, so as torepresent them as faithfully as possible in the bottleneck layer. The part of the networkbefore the bottleneck can then be interpreted as an encoder, the bottleneck of the network isthe bandwidth limited transmission channel, and the part after the bottleneck is the decoder.This framework can be generalized to the non-deterministic case, where encoder, transmissionchannel, and decoder are given by conditional probabilities P (outputjinput). For such acommunication chain it is natural to assume the Markov property, i.e. to assume, that theoutput of the various stages in the process only depends on their input and not on the historyof the process. This leads to the framework of folded Markov chains, which was introduced byLuttrell [79] in an attempt to unify di�erent unsupervised learning schemes as special cases ofprobabilistic multilayer autoencoders. In this chapter the folded Markov chain framework willbe used to derive cost functions for topographic clustering for both pairwise dissimilarity dataand Euclidean feature vectors. These cost functions will form the basis of the subsequentlydeveloped clustering algorithms.2.1.2 The General Folded Markov ChainAccording to [79] a general folded Markov chain (FMC) with L stages consists of L prob-abilistic transformations of an input item x0 to an output item xL via L � 1 intermediateitems x1; x2; : : : ; xL�1 and the corresponding Bayes inverse transformations to result in areconstructed version of the input item x00. Denote the probabilistic encoder from stagek to stage k + 1 by the conditional probability Pk+1;k(xk+1jxk) and the corresponding de-coder by ~Pk;k+1(x0kjx0k+1). Then Bayes' theorem relates the conditional probabilities of theencoder/decoder pairs as follows:~Pk;k+1(xkjxk+1)Pk+1(xk+1) = Pk+1;k(xk+1jxk)Pk(xk) ; (2.1)

CHAPTER 2. THE PROBABILISTIC AUTOENCODER FRAMEWORK 9where Pk(xk) is the marginal probability of xk at stage k. Any joint probability in the FMCframework is speci�ed by the source probabilities P0(x0) and the L probabilistic encodersPk+1;k(xk+1jxk); k = 0; 1; : : : ; L� 1. The joint probability can thus be expressed asP (x0; x1; : : : ; xL; x0L; : : : ; x01; x00) = P0(x0)P1;0(x1jx0) : : :PL;L�1(xLjxL�1)� �(xL; x0L) ~PL�1;L(x0L�1jx0L) : : : ~P0;1(x00jx01) ; (2.2)where �(xL; x0L) is the Kronecker delta and ensures that the output of the encoder is equalto the input of the decoder. The marginal probability Pk(xk) can be obtained by taking thesum over all preceeding stages in the FMC which givesPk(xk) = Xx0;x1:::xk�1 P0(x0)P1;0(x1jx0) : : :Pk;k�1(xkjxk�1) ; (2.3)where the sums here and in the following are assumed to run over all states of the respectivesummation variable. Note that the folded Markov chain framework could just as well bepresented in a continuum formulation with sums replaced by integrals and probabilities re-placed by probability densities. However, since this work is concerned with variants of vectorquantization based on �nite data sets and discrete code indices, the discrete notation seemsmore appropriate. Now let us introduce a measure d(x0; x00) which assignes a cost to thedistortion introduced by the encoding/decoding process for the pair fx0; x00g. A summationover all possible encodings and decodings in an L-stage folded Markov chain then provides acost function EL, which re ects the total distortion:EL = Xx0;x1;:::;xL Xx0L;x0L�1;:::;x00 P0(x0)P1;0(x1jx0) : : :PL;L�1(xLjxL�1)�(xL; x0L)� ~PL�1;L(x0L�1jx0L) : : : ~P0;1(x00jx01)d(x0; x00) : (2.4)The cost function (2.4) is a function of the encoders and decoders and thus yields an optimalitycriterion for the construction of probabilistic coding schemes. This work will only be concernedwith the special case of L = 2, which is depicted in Figure 2.1.2.2 Derivation of the Cost Functions2.2.1 The Two Stage Folded Markov ChainLet us now consider the cost function E2 for a two-stage FMC, which is a special case of (2.4):E2 = Xx0;x1;x2 Xx02;x01;x00 P0(x0)P1;0(x1jx0)P2;1(x2jx1)�(x2; x02) ~P1;2(x01jx02) ~P0;1(x00jx01)d(x0; x00) :(2.5)Since the decoders ~P0;1(x00jx01) and ~P1;2(x01jx02) are related to their corresponding encodersP1;0(x01jx00) and P2;1(x02jx01) via Bayes' theorem (2.1) in the form~P0;1(x00jx01) = P1;0(x01jx00)P0(x00)P1(x01)~P1;2(x01jx02) = P2;1(x02jx01)P1(x01)P2(x02) ; (2.6)

CHAPTER 2. THE PROBABILISTIC AUTOENCODER FRAMEWORK 10��x0��x00

P1;0(x1jx0) P2;1(x2jx1)~P1;2(x01jx02)~P0;1(x00jx01)d(x0; x00) �(x2; x02)-x0 -x1

� x01� x00 ? x2? x02Figure 2.1: Illustration of a probabilistic autoencoder in the form of a two-stage folded Markovchain. A data item x0 is encoded by probabilistic encoders P1;0(x1jx0) and P2;1(x2jx1) andrecovered by corresponding decoders P1;2(x01jx02) and ~P0;1(x00jx01) leading to data item x00. Theresulting distortion is measured as d(x0; x00).equation (2.5) can be simpli�ed using (2.6), cancelling P1(x01), and summing over x02 bymaking use of the �(x2; x02). This leads to an expression for (2.5) which does not depend onthe decoders ~P0;1(x00jx01) and ~P1;2(x01jx02) anymore:E2 = Xx0;x1;x2 Xx01;x00 P0(x0)P1;0(x1jx0)P2;1(x2jx1)P2;1(x2jx01)P2(x2) P1;0(x01jx00)P0(x00)d(x0; x00) : (2.7)However, the marginal probability P2(x2) had to be introduced, which { using (2.3) { can beexpressed as: P2(x2) = Xx0;x1 P2;1(x2jx1)P1;0(x1jx0)P0(x0) (2.8)Introducing the notation P2;0(x2jx0) =Px1 P2;1(x2jx1)P1;0(x1jx0) the cost function can thusbe expressed solely in terms of encoders:E2 = Xx0;x2;x00 P0(x0)P2;0(x2jx0)P2;0(x2jx00)P0(x00)P2(x2) d(x0; x00) (2.9)This formulation of the cost function, which has been derived without any assumptions aboutthe distortion measure d(x0; x00) will e�ectively form the basis for the soft topographic mappingfor proximity data (STMP) to be derived in Chapter 7.

CHAPTER 2. THE PROBABILISTIC AUTOENCODER FRAMEWORK 112.2.2 Cost Function for TMPIn order to make contact to known clustering results and equations, let us introduce a morespeci�c notation at this point. The data items x0; x00 are labelled by indices i; j; k : : : andthe representations x1; x01; x2; x02 are denoted by code labels r; s; t : : : Let us now choose thehitherto probabilistic encoder P1;0(x1jx0) = P1;0(rji) to be deterministic. In this case one canexpress the encoder in terms of a stochastic matrixM = (mir)i=1;:::;D;r=1;:::;N 2 f0; 1gD�N ,whose elements are binary assignment variables mir ;Prmir = 1; 8i, and may take onlyvalues from the set f0; 1g. In accordance with the SOM-literature [71, 72] let us �x thesecond encoder and express it in terms of a matrix H = (hrs)r=1;:::;N; s=1;:::;N 2 <N�N withelements hrs := P2;1(sjr), subject to the constraints Ps hrs = 1; 8r. In the literature onKohonen's SOM hrs is called a neighborhood function in the space of neurons and determinesthe coupling between neurons r and s due to their spatial arrangement in a \neural lattice".In another interpretation, which is more consistent with the autoencoder model, H is thetransition matrix of the noise on the internal representation of the data in the autoencoder.In the clustering framework, the noise induces transitions between cluster indices, which couldbe due to channel noise during transmission (see Chapter 4 for an example on source channelcoding). The data are given by a dissimilarity matrix D = (dij)i;j=1;:::;D 2 <D�D. Withthese notational conventions (2.9) can then be written as the following cost function for thetopographic mapping of D data items onto N clusters,ETMP(M) = 12 DXi;j=1 NXr;s;t=1 mirhrsmjthtsPDk=1PNu=1mkuhusdij ; (2.10)via their dissimilarity values dij . The factor 1=2 has been introduced for computationalconvenience. Note that the source distributions over data items P0(x0) and P0(x00) are implicitin the sums over the i.i.d. data items. The relation to topographic clustering is seen as follows.A cost dij is incurred, whenever two data items xi and xj are associated with the same cluster.How much a data item is associatiated with a cluster is determined as a linear combinationof the corresponding assignment variables mir weighted by the transition probabilities hrs.Hence, con�gurations of low cost are those in which close pairs of data vectors are assigned tothe same cluster or at least to a pair of clusters with high transition probability while dissimilarpairs of vectors should be assigned to di�erent clusters with a low transition probability. Thedenominator enforces coherent clusters by normalizing the cost by the cluster size.Let us consider three special cases of the above cost function ETMP . (i) If, the secondencoder or neighborhood matrix is taken to be hrs = �rs then a cost function is recovered thatis equivalent to the cost function for pairwise clustering introduced by Hofmann and Buhmann[58]. The normalizing denominator in (2.10) then becomesPDk=1mks and had been introducedin [58] based on heuristic arguments about cluster coherency. In this derivation it appearsas a natural consequence of Bayes' theorem applied to a probabilistic autoencoder. (ii) Inthe special case of a one-to-one mapping, i.e. N = D and Pimir = 1; 8r, one obtains a formequivalent to the C measure, which was introduced by Goodhill [39] as a unifying objectivefunction for topographic one-to-one mappings. (iii) For Euclidean data vectors a simplerversion of the cost function can be derived, which serves as the basis for topographic vectorquantization. This will become clear in the next section.

CHAPTER 2. THE PROBABILISTIC AUTOENCODER FRAMEWORK 122.2.3 The Squared Euclidean Distance DistortionStarting again from (2.5) one can further simplify the cost function in a di�erent way assum-ing, that the data items x0 and x00 have corresponding feature vectors in a Euclidean vectorspace, x0;x00 2 <d ; 8x0;x00, and that the distortion is measured as the squared Euclideandistance between these vectors, d(x0;x00) = kx0 � x00k2. Using x0 instead of x0 (2.5) can bewritten as E2 = Xx0;x2 Xx02;x00 P0(x0)P2;0(x2jx0)�(x2; x02) ~P0;2(x00jx02)kx0� x00k2 ; (2.11)where the shorthand notations P2;0(x2jx0) = Px1 P2;1(x2jx1)P1;0(x1jx0) and ~P0;2(x00jx02) =Px01 ~P0;1(x00jx01) ~P1;2(x01jx02) have been used. Again x02 is summed out using the �(x2; x02) andthen using Bayes' theorem (2.1) in the form P2;0(x2jx0)P0(x0) = ~P0;2(x0jx2)P2(x2) leads tothe following expression for the cost function:E2 =Xx2 P2(x2) Xx0;x00 ~P0;2(x0jx2) ~P0;2(x00jx2)kx0 � x00k2 : (2.12)Now expanding the norm, kx0�x00k2 = kx0k2+kx00k2�2x0 �x00, and performing summationswhere possible one obtainsE2 = 2Xx2 P2(x2)24Xx0 ~P0;2(x0jx2)kx0k2 � Xx0 ~P0;2(x0jx2)x0 235 : (2.13)This expression can be rewritten in order to obtain an expression involving the squaredEuclidean distance between the data vector x0 and a weighted mean over data space,E2 = 2Xx2 P2(x2)Xx0 ~P0;2(x0jx2) x0 �Xu0 ~P0;2(u0jx2)u0 2 : (2.14)Comparing the last expression (2.14) with (2.12) this derivation can be given a straight-forward interpretation: the average squared Euclidean distance of two vectors x0;x00 drawnindependently from ~P0;2(x0jx2) is twice the variance of vectors drawn from ~P0;2(x0jx2) (see[79]). Now apply Bayes' theorem (2.1) in the form ~P0;2(x0jx2)P2(x2) = P2;0(x2jx0)P0(x0) toreplace the decoder by the corresponding encoder:E2 = 2Xx0 P0(x0)Xx2 P2;0(x2jx0) kx0 �w(x2)k2 ; (2.15)with w(x2) = Pu0 ~P0;2(u0jx2)u0, which is called cluster center and depends on x2. Thisformulation of the cost function, which is based on a Euclidean vector representation of thedata items will e�ectively form the basis for the algorithms soft topographic vector quan-tization and kernel-based soft topographic mapping (STMK) of Chapter 4 and Chapter 6,respectively.

CHAPTER 2. THE PROBABILISTIC AUTOENCODER FRAMEWORK 132.2.4 Cost Function for TVQAdopting similar notation as introduced for the TMP cost function above, one can write downthe cost function ETVQ for what Luttrell [77] called topographic vector quantization,ETVQ(M;w) = 12 DXi=1 NXr=1mir NXs=1 hrs kxi �wsk2 ; (2.16)where again the �rst encoder is expressed in terms of a binary stochastic matrixM and thesecond encoder H is �xed. Additionally, a parameter vector w of dimension dN has beenintroduced, which is a concatenation (wT1 ;wT2 ; : : : ;wTN)T of the cluster centers wr 2 <d indata space.In order to understand the cost function (2.16) let us consider the special case hrs = �rs,for which ETVQ(M;w) is reduced to a cost function for central clustering [81, 74, 103].Its relation to clustering is seen as follows: Only those terms contribute to the cost, forwhich mir = 1, and the cost incurred equals half the squared Euclidean distance kxi �wrk2between data vector xi and cluster center wr. This corresponds to a winner-take-all (WTA)rule, because each data item is assigned to one cluster only. The con�guration of minimal costis achieved, when the distances between data vectors and the cluster centers to which theyare assigned are minimal. Since the cluster centers live in the same space as the data vectors,they serve as representatives in data space for those data vectors, which are assigned to them.Considering those vectors x 2 <d, which are not in the data set X as well, we can think of<d as being divided into N tesselation cells, one for each cluster center. This is e�ectively aquantization of the space, because each vector x 2 <d is assigned to and represented by onecluster center wr, hence the name vector quantization.Considering the full cost function (2.16), the matrix H breaks the permutation symmetryon the cluster indices and induces a coupling between cluster centers. The elements of H canbe looked upon as transition probabilities induced by channel noise, which was the originalmodel proposed by [76] and which has motivated the example of noisy vector quantization atthe end of Chapter 4. The transition probabilities are also closely related to the elements ofthe neighborhood matrix in the Self-Organizing Map algorithm [70, 69]. In this context theyserve mainly for visualization purposes, because due to the coupling, data points that areclose to each other in data space will be assigned to clusters that are close in the sense of ahigh transition probability. The exact relation to the SOM will be pointed out in Chapter 4.Now that the clustering problem has been formulated as an optimization task { to minimizethe cost function w.r.t. cluster centers and assignments { the next chapter will deal with thequestion of how to e�ciently solve this task.

Chapter 3Optimization3.1 Optimization Methods3.1.1 General RemarksSince clustering has been de�ned in terms of an optimization problem in the previous chapter,this section will brie y review some known optimization techniques. The next section willthen be devoted to deterministic annealing, the speci�c procedure used in this work.To begin with it should be noted that the idea of optimization, although closely related tomany engineering problems, can be traced back to the natural sciences and { especially { tophysics. The idea in physics is to formulate natural laws in terms of variational or optimizationprinciples instead of, e.g., equations of motion. Probably the most famous example is theprinciple of least action (see Feynman [31] for an entertaining text book account), which canserve as the basis for theoretical mechanics: A particle takes the path of least action, wherethe quantity action is a functional of the path, namely the di�erence between kinetic andpotential energy integrated over the path. This description incorporates the path as a wholeas opposed to the local nature of di�erential equations. Another example of this kind is theprinciple of least time in optics �rst formulated by Fermat [31].Apart from these principled considerations, optimization plays an extremely importantrole in disciplines like engineering, economics, logistics etc., and several methods have beendeveloped to solve optimization problems in these �elds. For many problems with continuousvariables that can be expressed in terms of linear or quadradic cost functions with appropriateconstraints, there exist standard algorithms, commonly referred to as linear programming(LP) and quadratic programming (QP) [46], respectively.In the more \interesting" cases, the variables to be optimized may be discrete (takingonly a �nite number of values) or the cost function and constraints may be non-linear inthe parameters. In these cases more elaborate schemes for searching the parameter spacehave to be used. In the following two of these techniques will be introduced. Stochasticgradient descent [90], which is particularly popular with neural network implementations,and simulated annealing [66], which is one of the most versatile optimization tools and servesas a conceptual basis for deterministic annealing.3.1.2 Stochastic Gradient DescentGiven a cost or energy function E(w) 2 <, which depends on a parameter vector w 2 <d,the aim of optimization is to �nd the parameter values w that minimize E(w). One way ofdoing this is to use gradient descent on the cost surface de�ned by E(w). Starting from an

CHAPTER 3. OPTIMIZATION 15initial guess w = w(0) the parameter vector is iteratively updated according tow(t+1) = w(t) +�w(t) : (3.1)In simple gradient descent or steepest descent algorithms, �w(t) takes the form�w(t) = ��(t)rE(w = w(t)) (3.2)where �(t) is the learning rate parameter and should be chosen as a decreasing function of timein order to speed up convergence in the beginning and enforce accuracy near the optimum.However, many other batch algorithms exist, that perform nonlinear optimization based ongradient information, among them conjugate gradient, Newton's and Quasi Newton methodsor in particular for sum-of-squares errors the Levenberg-Marquardt algorithm (see [7] fortextbook account).An alternative to the above batch optimization scheme is a sequential or online update.Suppose the cost function can be written as a sum of contributions Ei of single data items i.Then taking the data items in a random sequence, one can update the parameters accordingto �w(t) = ��(t)rEi(w = w(t)) : (3.3)Then for su�ciently small values of � the average direction of update should approximate thedirection of steepest descent. The time dependence of � has been analyzed by Robbins andMonroe [102], who prove convergence in probability, if the following condtions hold for thelearning rate: limt!1 �(t) = 0 (3.4)1Xt=0 �(t) =1 (3.5)1Xt=0 �(t)2 <1 ; (3.6)where the �rst condition (3.4) ensures that successive updates su�ciently decrease in magni-tude, the second condition (3.5) guarantees that the minimum can be reached, and the thirdcondition expresses the necessity of keeping the sampling noise low. In practice, a small con-stant value of � has been found e�ective although more elaborate schemes have been suggested[90].The online procedure has three advantages: (i) The algorithm can avoid getting stuck inlocal minima, because the e�ective cost function changes in each step. (ii) If the data setcontains redundancy sampling the data can lead to a reduction of computational costs ascompared to the batch procedure. (iii) Since data items can be discarded after the updateless storage capacity is necessary. (iv) If the underlying process changes slowly with time,this behavior can be tracked in the online mode. This work, however, will focus on batchmethods, which are more accessible in analytic terms, and can be approximated stochasticallyin an online mode.3.1.3 Simulated AnnealingOne of the most universally applicable optimization schemes is simulated annealing [66, 32,61]. It is based on ideas from statistical physics and can be seen as an application of the

CHAPTER 3. OPTIMIZATION 16Metropolis algorithm [84] to optimization. Let w be the parameter vector, E(w) the costfunction, and assume a �nite number of states fwg for the sake of simplicity. The idea of sim-ulated annealing then is to perform a stochastic search in the parameter space fwg. Startingfrom a random initialization w0 = wold a candidate parameter vector wnew is generated ran-domly in the vicinity of wold using a generating distribution G(w), which could be speci�edin terms of �w = wnew �wold byg(�w) = �2�� (d=2) exp��2�w2� ; (3.7)where d is the dimensionality of the parameter space. Then the following rule depending onthe di�erence in cost �E = E(wnew) � E(wold) of the two parameter vectors is applied foraccepting or rejecting wnew:if �E < 0 accept (3.8)if �E � 0 accept with probability exp(��E) ; (3.9)where � is called the inverse temperature parameter. This rule results in a Markov chainstochastic walk through fwg, which is not constrained to decreasing the cost function E(w)at each step, and can thus escape local minima of the cost function at �nite values of �.The probability distribution in equilibrium can be calculated using the principle of detailedbalance, P (wnewjwold)P (wold) = P (woldjwnew)P (wnew) ; (3.10)which yields the Gibbs distributionP (w) = 1Z exp(��E(w)) ; (3.11)with the partition function Z given byZ = Xfwg exp(��E(w)) : (3.12)In simulated annealing, � is used as an annealing parameter for the optimization and is variedduring the search, � = �(t). Starting with low values of �, where all of the search space iseasily accessible, the value of � is stepwise increased until the accessible volume of the searchspace is narrowed down to the global minimum of the cost function. In order to �nd theglobal minimum of the cost function, a careful annealing schedule for �(t) has to be chosen.Geman and Geman [32] showed that for the generating function (3.7) the schedule�(t) = �(0) ln t (3.13)leads to convergence in probability to the global minimum of the cost function.Simulated annealing as described above has several advantages. (i) It can be used for theoptimization of a large class of cost functions, irrespective of nonlinearities, discontinuities,and stochasticity. (ii) Boundary conditions and constraints can easily be satis�ed. (iii) Thereis a statistical guarantee for �nding an optimal solution.However, these advantages have their prices. As indicated by (3.13), simulated annealingis very slow, and its generality also means that it is di�cult to incorporate knowledge about

CHAPTER 3. OPTIMIZATION 17the optimization problem at hand into the procedure. In order to limit the computationalcosts of optimization, people frequently try to speed up convergence by accellerating theannealing process beyond the schedule given by (3.13), a technique referred to as simulatedquenching [61]. Then, however, the ergodicity of the system, i.e. its ability to reach all ofparameter space is lost, and the convergence guarantee no longer holds. In the followingsection, deterministic annealing will be introduced as an alternative to simulated annealingfor a limited class of problems.3.2 Deterministic Annealing and EM3.2.1 Deterministic AnnealingThe slow convergence of simulated annealing results from its stochastic search in parameterspace, which is costly in computational terms. However, in many optimization problems one isreally interested only in the minimum values of the cost function and not in an approximationof the resulting Gibbs distribution. Also, from the transition rules (3.8) it is clear that thedistribution over parameter space at a given value of � is going to be a Gibbs distributionanyway. Hence, the idea of deterministic annealing [103] is to calculate the Gibbs distributionover parameter space directly and �nd its minimum. This minimum is then tracked from lowto high values of �, where it is hoped to coincide with a good minimum of the original costfunction.Deterministic annealing has been applied to a wide variety of optimization problems in-cluding deformable models and matching problems [130, 132, 131, 36, 37, 114, 115], non-metricmultidimensional scaling [55, 67], supervised learning for classi�cation and regression [88, 98],the travelling salesman problem [30, 29, 105, 52], and data clustering [103, 104, 15, 16, 14,56, 59, 57, 58, 54, 86, 87] { the application on which this work is going to focus.The technique of deterministic annealing only works for a particular class of cost functions,because it is necessary to calculate or at least approximate the expression for the partitionfunction Z or equivalently of the free energy F . Therefore, this section will focus on costfunctions of the form E(M;w) =Xi Xr mireir(w) ; (3.14)where { as in Chapter 2 - M is a binary stochastic matrix with Prmir = 1; 8i and w is aparameter vector of real valued parameters. The cost function may also depend on data X ,but since these are given and constant, they will not be mentioned in the list of parameters.If we assume no extra knowledge about the problem at hand, we are lead to the applicationof the principle of maximum entropy inference for the derivation of a probability distributionover parameter space. This principle, which was advocated in the statistics framework byJaynes [62] and Tikochinsky et al. [119], states that for a given average cost U(w),U(w) = XfMgE(M;w)P (M;w) ; (3.15)the probability distribution P (M;w) that is maximally non-committal with respect to missingdata is obtained by maximizing the entropy S,S(w) = �XfMgP (M;w) lnP (M;w) ; (3.16)

CHAPTER 3. OPTIMIZATION 18under the additional constraint XfMgP (M;w) = 1 ; (3.17)where the summations in the last three equations runs over all admissible states fMg. It isalso known that the probability distribution derived from the maximum entropy principle hasmaximum stability in terms of the L2 norm if the temperature parameter � or equivalently theaverage cost U is changed [119]. Another argument is obtained from information geometry:Given a cost function, the corresponding family of Gibbs distributions parameterized by �forms a trajectory in the space of probability distributions which has minimal length [24].These properties ensure maximal robustness w.r.t. noise and make the principle of maximumentropy the natural choice for a stochastic optimization scheme. The maximum entropyprinciple leads to the well-known Gibbs distribution which was already introduced in thecontext of simulated annealing, (3.11) and (3.12),P (M;w) = 1Z(w) exp(��E(M;w)) ; (3.18)with the partition function Z(w) given byZ(w) = XfMg exp(��E(M;w)) : (3.19)From the partition function the free energy F can be calculated asF (w) = � 1� lnZ(w) = � 1� ln XfMg exp(��E(M;w)) : (3.20)The idea of deterministic annealing is to assume thermodynamic equilibrium and to con-sider F (w) at a given value of � as the e�ective cost function to be minimized. In the contextof optimization this is motivated by the fact that F (w) is a smoothed version of the originalcost function, which is recovered in the case � ! 1 with optimal values M�. The sum overfMg in this case is dominated by exp(��E(M�;w)) withM� = argminME(M;w) and thefree energy in the limit � !1 can be writtenlim�!1F (w) = E(M�;w) : (3.21)The parameter vector w that minimizes the free energy F (w) satis�es the conditions@F (w)@w = 0 : (3.22)As is often the case in statistical physics, the solution of this problem depends on the calcula-tion of the partition function (3.19). Assuming a cost function as given in (3.14) the partitionfunction (3.19) can be calculated according to [26]Z(w) = XfMg exp(��Xi Xr mireir(w)) =Yi Xr exp(��eir(w)) ; (3.23)which results in a convenient expression for F (w),F (w) = � 1� Xi lnXr exp(��eir(w)) : (3.24)

CHAPTER 3. OPTIMIZATION 19The optimality criterion for w from (3.22) can then be given asXi Xr P (mir = 1)@eir(w)@w = 0 ; (3.25)where P (mir = 1) = exp(��eir(w))Pr exp(��eir(w)) (3.26)is the probability of assignment, mir = 1. In the context of clustering this probability will beinterpreted as P (i 2 Cr), the probability of membership of a data item i in cluster Cr. Alsonote that hmiri = P (mir = 1), where hmiri is the expectation value of the binary variablemir taken w.r.t. the Gibbs distribution (3.18).Deterministic annealing proceeds as follows. Starting from low values of � � 0 the freeenergy F (w) (3.20) is minimized or { equivalently { (3.25) is solved using some standard localminimization procedure such as gradient descent or the EM-algorithm described later in thischapter. Subsequently, � is increased according to an annealing schedule, and the resultingfree energy is again minimized, starting from the parameter values w obtained at the previousvalue of �. Hence, the minimum of F (w) found at low values of � is tracked to high valuesof �, where the free energy F (w) coincides with E(M�;w) { see (3.21). The minimum foundshould then coincide with a good, i.e. low cost, minimum of the original cost function, or evenwith its global minimum. Convergence to a (one-change optimal) local minimum has beenestablished by Puzicha et al. [97] who also point out that convergence to the global minimumshould not be expected in general.0 2 4 6 8

a) b)

0 2 4 6 8Figure 3.1: Plot of the data space of a toy example consisting of three data points atf1; 3:8; 7:5g (crosses) in one dimension, and two cluster centers fw1; w2g (circles). Shownare the positions of the cluster centers for a) the global and b) the local minimum of the costfunction, which can also be seen in the plot of the cost surface in Figure 3.2To illustrate this procedure consider the following toy example of a cost function, whichis actually a special case of ETVQ from the previous chapter, where the coupling of clustershas been left out, E(M;w) = 12 DXi=1 NXr=1mirkxi � wrk2 : (3.27)For D = 3 data points xi at f1; 3:8; 7:5g and N = 2 clusters with cluster centers wr the costsurface has a local and a global minimum (neglecting interchange symmetry) as illustrated inFigure 3.1. The resulting cost function is shown in Figure 3.2 a). The annealing process startsby minimizing the free energy at a low value of � (Figure 3.2 b)), where only one minimumexists. This minimum is then tracked through intermediate values of � (Figure 3.2 c)) until

CHAPTER 3. OPTIMIZATION 20the free energy nearly coincides with the original cost function at high values of � (Figure 3.2d)).The success of deterministic annealing depends, of course, on the annealing schedule for�. On the one hand one would like to increase � as quickly as possible to safe computationtime, on the other hand, if � is increased at a high rate, there is the risk of \loosing" theminimum of F from one �-step to the next. In practice, exponential annealing schemes with�(t+1) = ��(t) have been found e�ective, where � was chosen between 1.1 and 2.3.2.2 EM AlgorithmDeterministic annealing reduces the problem of �nding a global optimum of a cost functionE to the solution of multiple local optimization problems on a family of free energy functionsF� parameterized by �. In the present work this local optimization is achieved by using anexpectation-maximization (EM) algorithm [25].The EM algorithm is a well established way of determining maximum likelihood parameterestimates in problems with unobserved or missing data [99, 83]. It has been applied to a widevariety of problems including latent variable density models [8, 106, 33], supervised learningfrom incomplete data [35], hierarchical mixtures of experts [65, 64], Boltzmann machines [126],various problems from traditional statistics [83], and many of the applications mentioned inthe previous section on deterministic annealing. Since the seminal paper by Dempster et al.about the EM algorithm, the method has been related to information geometry [3, 2] andstatistical physics [91, 132, 27], and more about its convergence properties has become known[128, 129]. First, the algorithm will be introduced in the general statistics framework, andthen the relation to the optimization problem at hand will be pointed out.Suppose there are observed data X , missing dataM and a parameter vector w. The goalis to �nd the maximum likelihood estimator of w given the observed data X and the jointprobability P (X ;M;w). The EM algorithm starts with an initial guess of the maximumlikelihood parameters w(0) and in the following iteratively generates successive estimates w(t)by repeating the following two stepsE Step: Compute the probability ~P (t) over the missing valuesM such that~P (t)(M) = P (MjX ;w(t�1)).M Step: Find w = w(t) such that hlnP (X ;M;w)i~P (t) is maximized.Thus the EM algorithm repeatedly estimates the probability distribution over the missingdata based on the previous parameter estimate (E step) and then performs maximimumlikelihood estimation (M step) for the joint data where the expectation h�i is taken w.r.t.the previous estimate of the probability distribution over the missing data. This scheme wasshown to converge in the sense that each step increases the log-likelihood ln P (X ;w) unlessit is already at a local maximum [25].The relation to the statistical physics framework for optimization is given by the identityF (w) = U(w)� 1�S(w) : (3.28)This leads to a formulation of the EM algorithm [91] in terms of the variational free energy~F ( ~P;w), ~F ( ~P;w) = hlnP (X ;M;w)i~P � S( ~P ) ; (3.29)

CHAPTER 3. OPTIMIZATION 21where S( ~P ) is the entropy of the distribution ~P , given by S( ~P ) = �hln ~P i ~P . Now we canreformulate the standard EM algorithm in terms of the variational free energy as follows:E Step: Find ~P = ~P (t) such that ~F ( ~P;w(t�1)) is minimized.M Step: Find w = w(t) such that ~F ( ~P (t);w) is minimized.By minimizing a common quantity ~F ( ~P;w) this formulation brings out the symmetry of theEM algorithm w.r.t. the optimization of the parameters on one side, and the distibution overmissing data on the other. At a given temperature � this algorithm can be used to minimizethe free energy in the deterministic annealing procedure. For the cost functions derived inChapter 2 this will, essentially, lead to the development of di�erent topographic clusteringalgorithms in the subsequent chapters.

CHAPTER 3. OPTIMIZATION 22

0 5 10 0

5

10

w1

w2

d)

c)

b)

a)

Figure 3.2: The e�ect of � on the optimization problem illustrated for the toy problem fromFig. 3.1. Plot a) shows the iso-cost lines in the parameter space fw1; w2g. Each half (due tointerchange symmetry) exhibits a local and a global optimum. Plot b) shows the free energyfor a small value of � = 0:02. The landscape is smoothed out and there is only one globaloptimum at the center of mass of the data. c) shows the free energy for an intermediate valueof � = 0:23. The symmetry is broken and the global optimum (in each half) has moved awayfrom the center of mass of the data. Plot d) shows the free energy for a large value of � = 1,where it has almost assumed the form of the original cost function. The dashed lines showthe position of the global minimum (in each half) in the original cost function.

Chapter 4Soft Topographic VectorQuantization4.1 Derivation of the STVQ-Algorithm4.1.1 Stationarity ConditionsConsider the cost function ETVQ derived in Chapter 2 for the squared Euclidean distancedistortion, ETVQ(M;w) = DXi=1 NXr=1mireir ; (4.1)with partial assignment costs eir = 12 NXs=1 hrs kxi �wsk2 ; (4.2)and constraints Xr mir = 1 ; 8i ; (4.3)and Xs hrs = 1 ; 8r : (4.4)This cost function { of course { has exactly the form assumed in the derivation of the expres-sions for deterministic annealing in Chapter 3. Applying the principle of maximum entropy,a Gibbs distribution is obtained, whose partition function, which involves the sum over allpossible states fMg, leads to the free energyF (w) = � 1� lnZ(w) = � 1� ln XfMg exp(��E(M;w)) : (4.5)Then { using the fact from (3.23) that the partition function Z(w) factorizes { the stationarityconditions for the cluster centers wr at a given value of the temperature parameter � are@F (w)@wr =Xi Xr P (mir = 1)@eir(w)@wr = 0 ; 8r : (4.6)

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 24Inserting the partial assignment costs eir from (4.2) and solving for wr yields the followingexpression, wr = Pi xiPs hrsP (i 2 Cs)PiPs hrsP (i 2 Cs) ; 8r ; (4.7)where P (i 2 Cs) is the assignment probability of data point xi to cluster Cs and is given byP (i 2 Cs) = hmisi = exp ��2 Pt hst kxi �wtk2�Pu exp ��2 Pt hut kxi �wtk2� ; (4.8)where hmisi is the expectation value of the binary assignment variable mis for a given setfwrg w.r.t. the Gibbs distribution (3.18). Equation (4.7) can be interpreted as a generalizedcentroid condition or weighted mean of the data vectors. The \optimal" cluster centers fw�rgare positioned such that they represent the average of the data vectors assigned to them,where the clusters are weighted by the neighborhood or transition matrix H and the datavectors are weighted by their assignment probabilities, which in turn depend on the clustercenters wr as seen from (4.8). The assignment probabilities P (i 2 Cs) can be described assoft-min functions w.r.t. the partial assignment costs eir. The fact that the P (i 2 Cs) can becalculated separately for each data vector is a consequence of the factorial form of the partitionfunction Z(w). It means that assignments of data points to clusters are independent giventhe cluster centers.4.1.2 Deterministic Annealing and the EM AlgorithmFollowing the prescriptions from Chapter 3 deterministic annealing can now be applied to theproblem of determining the optimal cluster centers w�r. The resulting algorithm is called softtopographic vector quantization (STVQ), and the following pseudo-code gives an overview ofthe procedure. Soft Topographic Vector Quantization (STVQ)initializewr hxii + nr; 8r, nr � N (0; �2), �2 smallcalculate lookup table for hrschoose �start; �final, annealing factor �,and convergence criterion �� startwhile � < �final (Annealing)repeat (EM)E step: calculate P (i 2 Cr) ; 8i; r using Eq. (4.8)M step: calculate wnewr ; 8 r using Eq. (4.7)until kwnewr �woldr k < � ; 8 r� � �endAlthough the temperature parameter � was introduced to control the annealing procedureand thus to �nd good solutions of the original cost function (4.1), this parameter can begiven another interpretation. In analogy to Gaussian mixture models the parameter � can

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 25also be interpreted as an inverse variance in data space thus determining the resolution ofthe clustering. Consequently, the annealing process corresponds to a stepwise re�nementof the representation of the data and it is possible to determine the resolution of the �nalrepresentation by terminating the annealing schedule at an appropriate value of �. This isparticularly appropriate to avoid over-�tting of the data in the presence of noise. As will beseen in Chapter 5 the increase of � gives rise to a phase transition, i.e. a split of the clustercenters, which is related to the maximal variance of the data. This phenomenon supports theidea to think of � as a scale parameter. In practice, when one is interested in the solution forhigh values of �, it is convenient to choose �start � ��, where �� is the critical value of � forthe initial phase transition as calculated in (5.7), to avoid wasting computation time beforethe initial phase transition occurs.4.1.3 Derivatives of the STVQ-AlgorithmTo put the above derived algorithm (STVQ) into a familiar context let us consider certainlimits and approximations which lead to a family of topographic clustering algorithms.The limiting case � !1 in the assignment probabilities (4.8) yields a batch version of thetopographic vector quantizer (TVQ) discussed by Luttrell [79] and Heskes et al. [51]. TVQis a winner-take-all algorithm for which (4.7) and (4.8) becomewr = Pi xiPs hrsPTVQ(i 2 Cs)PiPs hrsPTVQ(i 2 Cs) ; 8r ; (4.9)and PTVQ(i 2 Cs) = �st ;t = argminu Xv huvkxi �wvk2 : (4.10)The approximation hrs ! �rs in the assignment probabilities (4.8) leads to a new proba-bilistic version of the SOM which will be called soft-SOM (SSOM). This modi�cation providesan important computational simpli�cation because the omission of one convolution with hrssaves a considerable amount of computation time. Equations (4.7) and (4.8) then becomewr = Pi xiPs hrsPSSOM (i 2 Cs)PiPs hrsPSSOM (i 2 Cs) ; 8r ; (4.11)and PSSOM (i 2 Cs) = exp ��2 kxi �wsk2�Pt exp ��2 kxi �wtk2� : (4.12)It has been noted by Luttrell [79, 76], however, that (4.11) and (4.12), which correspondto a nearest-neighbor encoding, do not in general minimize the cost function (4.1). Anexact minimization is only achieved, when the non-zero transition probabilities are taken intoaccount not only in the update rule but also in the determination of the winner as in (4.7)and (4.8) for STVQ.If one combines the limiting case � ! 1 with the approximation hrs ! �rs in (4.8) oneobtains a batch version of the SOM [89], for which Kohonen's original algorithm [70, 69] isa stochastic approximation [102] as discussed in Chapter 3. Equations (4.7) and (4.8) thenbecome

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 261-� 1-� 1-�-hrs �rs

-hrs �rs -hrs �rs-hrs �rs? ? ?--

- -STVQTVQ

SSOMSOM

SCHC

M - StepM - Step

E - StepE - StepFigure 4.1: The STVQ-family of clustering algorithms.wr = Pi xiPs hrsPSOM(i 2 Cs)PiPs hrsPSOM (i 2 Cs) ; 8r ; (4.13)and PSOM(i 2 Cs) = �st ;t = argminu kxi �wuk2 : (4.14)Finally, substituting hrs ! �rs in both (4.7) and (4.8) yields the soft clustering procedure(SC) proposed by Rose et al. [103], whose limit � ! 1 recovers the well-known LBGalgorithm (HC)[74] or { in the online version { k-means clustering [81]. Figure 4.1 summarizesthe family of topographic clustering algorithms.4.2 STVQ for Noisy Source Channel Coding4.2.1 Transmission via a Noisy ChannelIn order to demonstrate the applicability of STVQ and in particular of SSOM to sourcechannel coding both algorithms were applied to the compression of image data, which werethen sent via a noisy channel model, and decoded after transmission. The data transmissionscenario can be seen in Figure 4.2, which depicts the process of encoding, noisy transmission,decoding, and the distortion measure used in STVQ. The channel model was a binary sym-metric channel, which is characterized by a bit error rate (BER) and the number of bitsn. The transition probabilities hrs can then be expressed in terms of the Hamming distancedH(r; s) between the n-bit binary representations of r and s,hrs = (1� )n�dH (r;s) dH(r;s) : (4.15)As a training set three 512 � 512 pixel 256 gray-value images were used, which were takenfrom di�erent scenes. The images were split into blocks of size d = 2 � 2 for encoding. The

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 27size of the codebook was chosen to be N = 16 in order to achieve a compression to 1 bitper pixel (bpp). We applied an exponential annealing schedule with � = 2, and determinedthe start value �start to be just below the critical �� for the �rst split as will be calculatedin Chapter 5 equation (5.7). Note that with the transition matrix as given in (4.15) thisoptimization corresponds to the embedding of an n = 4 dimensional hypercube in the d = 4dimensional data space. The resulting codebooks were tested by encoding the test imageLena (Figure 4.4), which had not been used for determining the codebook, simulating thetransmission of the indices via a noisy binary symmetric channel with given bit error rate andreconstructing the image using the codebook.<

>

>>

-

--

EncoderChannel Noise hrs : r sDecoder

xi

ws ws

xi rsk xi �ws k2Distortions ws

xi wr

Figure 4.2: Cartoon of the source channel coding procedure for data communication. Inputdata xi are grouped and the groups (clusters) are labeled with indices r (encoding stage). Theindices are then transmitted via a noisy channel which is characterized by a set of transitionprobabilities hrs for the noise process. As soon as an index s is received at the decoderthe data is reconstructed via a vector ws (decoding stage) which represents all data pointsassigned to cluster s during encoding. In STVQ the combined error due to clustering andchannel measured as the squared Euclidean distance between the original data vector xi andthe cluster center ws is minimized averaged over all transitions r! s.4.2.2 ResultsThe results of the source channel coding experiments are summarized in Figure 4.3 whichshows a plot of the signal-to-noise-ratio per pixel (PSNR) as a function of the bit error ratefor STVQ (diamonds), SSOM (vertical crosses), and LBG (oblique crosses). STVQ shows thebest performance especially for high BERs, where it is naturally far superior to the LBG-algorithm which does not take into account channel noise. SSOM, however, performs onlyslightly worse (approx. 1 dB) than STVQ. Considering the fact that SSOM is computationallymuch less demanding than STVQ (O(N) for encoding) { due to the omission of the convolutionwith hrs in (4.8) { the result demonstrates the e�ciency of SSOM for source channel coding.

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 28-4

-2

0

2

4

6

8

10

12

14

16

1e-05 0.0001 0.001 0.01 0.1 1

PS

NR

(db

)

BER

STVQSSOM

SSOM 5% BERLBG

Figure 4.3: Comparison between di�erent vector quantizers for image compression, noisychannel (BSC) transmission and reconstruction. The plot shows the signal-to-noise-ratio perpixel (PSNR), de�ned as 10 log10(�signal=�noise), as a function of bit error rate (BER) forSTVQ and SSOM, each optimized for the given channel noise, for SSOM, optimized for aBER of 0:05, and for LBG. The training set consisted of three 512� 512 pixel 256 gray-valueimages with block size d = 2 � 2. The codebook size was N = 16 corresponding to 1 bitper pixel (bpp). The annealing schedule was exponential with � = 2 and the convergenceparameter was � = 10�5. Lena was used as a test image.Figure 4.3 also shows the generalization behavior of an SSOM codebook optimized for a BERof = 0:05 (rectangles). Since this codebook was optimized for = 0:05 it performs worsethan appropriately trained SSOM codebooks for other values of , but still performs betterthan LBG except for low values of . At low values, SSOMs trained for the noisy case areoutperformed by LBG because robustness w.r.t. channel noise is achieved at the expense ofan optimal data representation in the noise free case. Figure 4.4, �nally, provides a visualimpression of the performance of the di�erent vector quantizers at a BER of = 0:033. Whilethe reconstruction for STVQ is only slightly better than the one for SSOM, both are clearlysuperior to the reconstruction for LBG.

CHAPTER 4. SOFT TOPOGRAPHIC VECTOR QUANTIZATION 29

Original LBG SNR 4.64 dBSTVQ SNR 9.00 dB SSOM SNR 7.80 dBFigure 4.4: Lena transmitted over a binary symmetric channel with BER of = 0:033 encodedand reconstructed using di�erent vector quantization algorithms. The parameter settings werethe same as in Figure 4.3.

Chapter 5Phase Transitions in STVQ5.1 Initial Phase TransitionIn order to understand the annealing process in the temperature parameter � it is instructiveto look at how the representation of the data changes with �. From Rose et al. [103] andBuhmann et al. [15] it is known that the cluster centers split with increasing � and that thenumber of relevant clusters for a resolution given by � is determined from the number ofclusters that have split up to that point. In STVQ, however, the permutation symmetry ofthe cluster centers is broken and couplings between clusters are introduced by the transitionmatrix H. This changes stationary states and the \splitting" behavior of the cluster centers.For � = 0, which corresponds to in�nite temperature, every data point xi is assigned toevery cluster Cr with equal probability P 0(i 2 Cr) = 1=N where N is the number of clustercenters. In this case the cluster centers are given byw0r = 1DXi xi ; 8r ; (5.1)that is, all the cluster centers are located at the center of mass of the data. Without loss ofgenerality let w0r = 0; 8 r. A Taylor-expansion of the r.h.s. of (4.7) around fw0rg to �rst orderin wt yieldswr = �Pi xiPs hrsP (i 2 Cs)PiPs hrsP (i 2 Cs) �fw0rg+Xt � @@wtPi xiPs hrsP (i 2 Cs)PiPs hrsP (i 2 Cs) �fw0rgwt+O(w2t) : (5.2)Under the assumption that H is symmetrical, i.e. hrs = hsr; 8 r; s, this expression can beevaluated using the relation@P (i 2 Cs)@wt = � (xi �wt)P (i 2 Cs) hst �Xu htuP (i 2 Cu)! ; (5.3)and the linearized �xed-point equations becomewr = �CXt grtwt : (5.4)Here C = 1D Pi xixTi is the covariance matrix of the data andgrt =Xs hrs �hst � 1N� (5.5)

CHAPTER 5. PHASE TRANSITIONS IN STVQ 31are the elements of a matrix G which acts on the cluster indices. The system of equations(5.4) decouples under transformation to the eigenbasis of the covariance matrix C in dataspace and to the eigenbasis of the matrix G in cluster space. The former transformation isalso known as principal component analysis (PCA) [60]. Denoting the transformed clustercenters by w0�k, where � and k designate the components in the new bases of data space andcluster space, and prime and hat denote PCA and the transformation to the eigenbasis of G,(5.4) becomes w0�k = �� C� �Gk � w0�k ; (5.6)where �C� and �Gk are the eigenvalues for the eigenvectors vC� and vGk . (5.6) can only havenon-zero solutions for � �C� �Gk = 1. Hence, there is a critical ��,�� = 1�Cmax�Gmax ; (5.7)at which the center of mass solution becomes unstable, clusters split, and a new representationof the data set emerges. �� depends on the data via the largest eigenvalue �Cmax of thecovariance matrix C whose eigenvector vCmax denotes the direction of maximum variance�2max = �Cmax of the data. Consequently, the split of the clusters occurs along the principalaxis in data space. �� also depends on the transition matrix H via the largest eigenvalue�Gmax of the matrix G. The largest eigenvalue �Gmax indicates which eigenvector vGk = vGmaxis dominant and therefore determines the direction in cluster space in which the split occurs.Any component w0�r of vectorw0� = (w0�1; : : : ; w0�N)T can be expressed as a linear combinationw0�r = Pk w0�kvGkr of components vGkr of eigenvectors vGk = (vGk1; : : : ; vGkN)T of the matrixG. Thus the development of cluster center component w0�r under the linearized �xed-pointequation (5.6) depends on the value of component r of eigenvector vGmax. Given the principalaxis in data space, the eigenvector vGmax indicates in which direction along this axis as wellas how far each cluster center moves relatively to the other cluster centers in the linearapproximation.In order to express this result in terms of eigenvectors vHk and eigenvalues �Hk of H, it isobserved that G and H have the same set of eigenvectors. It follows from (5.5) that vGmax isidentical to the eigenvector of H which corresponds to its second largest eigenvalue �Hk , with(�Hk )2 = �Gmax.The above results can be extended to SSOM, which is based on the �xed-point equations(4.11) and (4.12). For SSOM the matrix G, whose elements are given by (5.5), must simplybe replaced by GSSOM with elements gSSOMrt = hrt � 1N .5.2 Automatic Selection of Feature DimensionsA similar analysis as above can be carried out with regard to the phenomenon of the automaticselection of feature dimensions, a term �rst used by Kohonen [71] in the context of dimensionreduction [28, 94]. Let us consider a d-dimensional data space and an array of clusters labeledby n-dimensional index vectors r. The couplings hrs of clusters are de�ned on this array andare typically chosen to be a monotonically decreasing function of k r � s k. For d > n asimple representation of the input data is achieved, if the data has signi�cant variance onlyalong n of the d dimensions. In this case, the vectors wr lie in an n-dimensional subspaceand the excess-dimensions are e�ectively ignored (see Figure 5.1a)). If, however, the variance

CHAPTER 5. PHASE TRANSITIONS IN STVQ 32a) −6 −4 −2 0 2 4 6

−4

−2

0

2

4

x

y

b) −6 −4 −2 0 2 4 6−4

−2

0

2

4

x

y

Figure 5.1: The phenomenon of \dimension reduction" and the automatic selection of featuredimensions. States of minimal free energy are shown a) before the phase transition (�y = 0:4),and b) after the transition (�y = 1:8) for a one-dimensional array of N = 128 cluster centersand a two-dimensional data space. The chain of clusters as well as the x-dimension in dataspace are subject to periodic boundary conditions. The x-direction is referred to as thelongitudinal dimension, the y-direction is called transversal dimension. The dots representdata points and the �lled circles the locations wr of the cluster centers. Those cluster centerswhose labels di�er by one are connected by lines. The transition probabilities hrs correspondto a Gaussian neighborhood function of standard deviation �h = 5:0. Parameter values� = 1:3 and � = 10:0 lead to a critical standard deviation ��y = 1:25 and a critical modek� = 3 for the transition.

CHAPTER 5. PHASE TRANSITIONS IN STVQ 33of the data along the excess-dimensions surpasses a critical value, the original representationbecomes unstable, and the array of vectors wr folds into the excess-dimensions so as torepresent them as well (see Figure 5.1 b)). This phenomenon was studied in a formal way byemploying a Fokker-Planck approximation for the dynamics of the (zero temperature) SOMon-line learning algorithm [92, 100]. In the following an analysis is presented for the fullSTVQ-family by investigating the �xed-point equations (4.7) and (4.8) and comparing theresults to the limiting case of SOM.5.2.1 Phase Transition in the Discrete CaseFor this purpose, let us examine the stability of (4.7) and (4.8) around a known �xed-point.Let us consider the case of an in�nite number of data points generated by an underlyingprobability distribution P (x). The �xed-point equations then readwr = R P (x)xPs hrsP (x 2 Cs)dxR P (x)Ps hrsP (x 2 Cs)dx ; 8r ; (5.8)P (x 2 Cs) = exp ��2 Pt hstkx�wtk2�Pu exp ��2 Pt hutkx�wtk2� ; (5.9)where cluster indices r are now n-dimensional index vectors which lie on an n-dimensionalcubic array, r 2 N n; r� 2 f1; 2; � � � ; Ng. For the following assume that hrs : N � N ! [0; 1]obey hrs = hkr�sk. For notational convenience, the data space X is split into two subspaces,X = X k�X?, one for the embedding or longitudinal dimensions X k with elements xk and onefor the excess or transversal dimensions X? with elements x?. Also assume the probabilitydistribution P (x) over data space X to factorize as P (x) = P (xk)P (x?), where the probabil-ity distribution P (x?) in the transversal dimensions has zero mean, i.e. R P (x?)x? dx? = 0.In the longitudinal dimensions of data space assume the factorization P (xk) = Q� P (xk�),with P (xk�) = 1 for � 2 � xk� � 2 and P (xk�) = 0 otherwise, and let us consider the system inthe approximation N !1, `!1 and � := N �nite. Since the variance in the longitudinaldata space is e�ectively in�nite one obtains for the �xed-point of (5.8) (see Appendix 9.1)wk0r = ��1r and w? 0r = 0 ; 8r : (5.10)Again (5.8) can be expanded to �rst order in wt around the �xed-point fw0rg just as in(5.2). The assignment probability P 0(x 2 Cs) of a data point x to a cluster Cs in the �xed-point state (5.10) depends on the longitudinal components of x only and { abusing notation{ one can write P 0(x 2 Cs) = P 0(xk 2 Cs). Let us consider the stability of (5.10) alongthe transversal dimensions which determines the critical parameters for the phase transitiondepicted in Figure 5.1b). UsingZ P (x)x?Xs hrs P 0(xk 2 Cs) dx = 0(see Appendix 9.1, equation (9.4)) one obtains for the transversal components of the clustercenters w?r in the linear approximationw?r =Xt R P (x)x?Ps hrs h@P (x2Cs)@wt ifw0rg dxR P (x)Ps hrs P 0(xk 2 Cs) dx �wt : (5.11)

CHAPTER 5. PHASE TRANSITIONS IN STVQ 34The denominator of (5.11) evaluates to N�n (see Appendix 9.1, equation (9.7)) because inthe average over data space for the �xed-point no cluster is singled out. Inserting (5.3) into(5.11) one obtains w?r = �CXt Xs hrs hst �Xu htu fus!w?t ; (5.12)in which C = R P (x?)x? x?T dx? is the covariance matrix of the transversal dimensions ofdata space and fus, fus = �n Z P 0(xk 2 Cu)P 0(xk 2 Cs) dxk ; (5.13)is essentially the correlation function of the assignment probabilities of clusters Cu and Csin the �xed-point state fw0rg taken over data space. fus depends on � via the assignmentprobabilities P 0(xk 2 Cu). Note, that (5.12) has the same form as (5.4) when grt is taken tobe grt =Ps hrs (hst �Pu htu fus).Equations (5.12) can again be decoupled in data space by a transformation to the eigen-basis of C. Denoting the components of the transformed cluster centers by w0?�r, where � isthe index with respect to the eigenvector vC� with eigenvalue �C� , (5.13) readsw0?�r = � �C� Xt Xs hrs hst �Xu htu fus!w0?�t : (5.14)From hrs = hkr�sk, it follows that frs = fkr�sk (see Appendix 9.2). De�ning the discreteconvolution for two lattice functions ar and bs to be (a � b)r = Ps a(r�s)bs ,(5.14) can bewritten as w0?�r = � �C� �h � (h� h � f) � w0?� �r : (5.15)Application of the discrete Fourier-transform, ak = Pr ar exp (i (k � r) ), to (5.15) leads to adecoupling of (5.15) in cluster space as well and one obtainsw0?�k = � �C� h2k �1� fk� w0?�k ; (5.16)where the fact was used that the modes in k-space depend only on the absolute value k := kkkdue to the isotropy of the neighborhood function, of the data distribution, and of the �xed-point state. Equation (5.16) can only have non-zero solutions if � �C� h2k �1� fk (�)� = 1.Since �C� = �2�, where �2� is the variance along the �-axis in data space, it is clear that thecluster centers will automatically select the direction in transversal data space with maximumvariance �2max. Thus the eigenvector vCmax gives the direction in data space in which the arrayof cluster centers folds �rst. The critical temperature �� at which this transition occurs isgiven implicitly by �2maxh2k�� 1� f�k (��)�� 1 = 0 ; (5.17)where the critical mode k� is the mode k for which (5.17) has a solution with minimal �. Fora given � an explicit expression for the critical variance (��max)2 can be obtained:(��max)2 = 1� h2k� �1� fk�(�)� ; (5.18)

CHAPTER 5. PHASE TRANSITIONS IN STVQ 35where k� = argmaxk h2k �1� fk(�)� : (5.19)Very similar results can be derived for the SSOM when the approximation to the E step(4.12) is applied. The resulting equations are identical to (5.17), (5.18), and (5.19) exceptthat hk is not squared and fk (�) has to be calculated using the approximation given in (4.12).5.2.2 Continuous Gaussian CaseTo analytically determine values for (��max)2 and k� for a given � from (5.18) and (5.19), letus choose hrs Gaussian with variance �2h on the distance kr� sk between clusters r and s inthe array. Also, a continuum approximation is considered, i.e. all index vectors r and theirassociated index vectors in k-space are real and all functions that were previously de�ned onN n are now de�ned on the corresponding continuum <n. Under these conditions hrs can beexpressed as hrs ! h (kr� sk) = � 1p2 ��h�n exp �kr� sk22 �2h ! ; (5.20)where n denotes the dimensionality of the cluster array. Inserting (5.20) into (5.13) andreplacing sums by integrals yields (see Appendix 9.3)frs ! f (kr� sk) = 0@s�4 ��21An exp��4 �2kr� sk2� : (5.21)Inserting the Fourier-transformations of hkr�sk and fkr�sk into (5.19) one obtains(k�)2 = ��2 log 1 + �2� �2h! (5.22)from h @@k h2k �1� fk (�)�ik� = 0. Inserting (5.22) into (5.18) �nally provides the criticalvariance (��max)2, (��max)2 = 1� + �2h�2! 1 + �2��2h!��2h�2 ; (5.23)for the mode k�.An interesting aspect of (5.23) is that 1=� and �2h=�2 appear to play a very similar role.Interpreting � as an inverse variance of the noise in data space, (5.23) is essentially the sumof the variance in data space given by 1=� and the variance �2h of the noise in cluster spacescaled to data space by a factor ��2.The above results are also valid for the case � !1 which corresponds to the TVQ givenin (4.9) and (4.10). From (5.22) and (5.23) one obtainslim�!1 (k�)2 = 1�2h (5.24)lim�!1 (��max)2 = �2h e�2 : (5.25)

CHAPTER 5. PHASE TRANSITIONS IN STVQ 36Equation (5.24) shows that high values of �2h, i.e. long ranged coupling between clusters,suppress high transversal modes. From (5.25) it can be seen, that the critical variance (��max)2is proportional to the variance of the neighborhood function �2h scaled to data space by a factor��2. Thus the stability of the �xed-point state fw?0r g w.r.t. variance of the data along thetransversal direction in data space can be adjusted by changing �2h.All the above results carry over to the SOM version of the algorithm, (4.11) and (4.12),if �2h is replaced by �0 2h =2 in (5.22) to (5.25), where �02h denotes the variance of the SOMneighborhood function. For the wavelength �� of the critical mode one obtains (� = 1)�� = 2�k� = �0h�p2 � 4:44 �0h : (5.26)If the critical variance (��max)2 is expressed in terms of the half width s� of a homogeneousdata distribution one obtains s� = �0hq3e=2 � 2:02�0h : (5.27)The last two results (5.26) and (5.27) are identical to those presented by Ritter and Schulten[100] for the on-line version of Kohonen's SOM algorithm with a Gaussian neighborhoodfunction using the Fokker-Planck approach.5.3 Numerical ResultsIn this section numerical results are presented to validate the analytical calculations and toillustrate the deterministic annealing scheme. First STVQ is applied to a toy problem with asu�ciently simple transition matrixH for which the eigenvectors and eigenvalues can be easilycalculated. Then, in order to demonstrate the e�ects and advantages of the deterministicannealing scheme for STVQ, a two-dimensional array of clusters in a two-dimensional dataspace is considered. Finally, the behavior of a one-dimensional \chain" of 128 clusters in atwo-dimensional data space is investigated to validate the results for the automatic selectionof feature dimensions of the previous section.5.3.1 Toy ProblemConsider a two-dimensional data space with 2000 data points which were generated by anelongated Gaussian probability distribution P (x) = (2�)�1jCj� 12 exp(�12xTC�1x) with di-agonal covariance matrix C = diag(1:0; 0:04). N = 3 cluster centers were coupled via atransition probability matrix H,H = 11 + s 0B@ 1 s 0s 1� s s0 s 1 1CA : (5.28)This choice of H corresponds to a \chain" of clusters where each cluster is linked to itsnearest neighbor via the transition probability s=(1 + s), while second nearest neighbors areuncoupled because the transition probabilities h13 = h31 vanish. The magnitude of s governsthe coupling strength and the normalization factor is included to comply with condition (4.4).Figure 5.2 shows the x-coordinates of the positions wr of the cluster centers in data spaceas functions of the inverse temperature parameter � for the con�guration of minimal free

CHAPTER 5. PHASE TRANSITIONS IN STVQ 37energy. At a critical inverse temperature �� = 1:21 the cluster centers split along the x-axis which is the principal axis of the distribution of data points. In accordance with theeigenvector vGmax, vGmax = (�1; 0; 1)T ; (5.29)for the largest eigenvalue �Gmax of the matrix G given in (5.5) two cluster centers move toopposite positions along the principal axis while one remains at the center. Therefore, atopologically correct ordering is already established at the initial phase transition.0 1 2 3 4 5 6

−1

−0.5

0

0.5

1

β

wrx

Figure 5.2: Plot of the projections wxr of the cluster centers onto the �rst principal axis of thedata as functions of � for the toy problem with N = 3 cluster centers and nearest neighborcoupling. 2000 data points are chosen randomly and independently from the Gaussian prob-ability distribution given by P (x) = (2�)�1jCj�12 exp(�12xTC�1x) with diagonal covariancematrix C = diag(1:0; 0:04). Cluster centers are initialized at the origin and STVQ is appliedfor di�erent values of �. The STVQ convergence criterion was given by � = 10�10. Theanalytically determined critical value of � was �� = 1:21 for a coupling strength of s = 0:1.It corresponds to the trifurcation point seen in the plot.Figure 5.3 shows the critical value �� of the temperature parameter as a function of thenearest neighbor coupling strength s. Error bars indicate the numerical results, which are inagreement with the theoretical prediction of (5.7) (solid line).The inset displays the average cost U ,U = 12Xi Xr P (i 2 Cr)Xs hrs kxi �wsk2 (5.30)as a function of � for a coupling strength of s = 0:1. The visible drop of the average costoccurs at � = 1:25. Note that the transition zone is �nite due to �nite size e�ects.

CHAPTER 5. PHASE TRANSITIONS IN STVQ 380 0.1 0.2 0.3 0.4 0.5

1

1.5

2

2.5

s

β

*

0 2 4 60.2

0.3

0.4

0.5

0.6

Figure 5.3: Plot of the critical value �� of the temperature parameter as a function of thecoupling strength s for the STVQ toy problem of Figure 5.2. Error bars denote the numericalresults. For each value of s the cluster centers are initialized at the origin, and � is linearlyannealed according to �(t+1) = �(t) + 0:02 with �(0) = 0:0 and �final = 6:0 while monitoringU . For low values of �, the average cost U is constant. The lower error margins denote the �values, for which the �rst change in U occurs and the upper error margins denote the � values,for which the large drop in U occurs. The line shows the theoretical prediction calculatedfrom (5.7) for �Cmax = �2x = 1:0 and �Gmax = 1=(1 + s)2. Inset: Plot of the average cost U asa function of � for a typical example (s = 0:1). The visible drop in U occurs at � = 1:25.5.3.2 Annealing of a Two-Dimensional Array of Cluster CentersLet us now consider a two-dimensional data space and a set of 8� 8 clusters labeled by two-dimensional index vectors r, r� = f1; 2; � � � ; 8g. The D = 8� 8 data points lie equally spacedon a grid in the unit square. The transition probabilities hrs are chosen from a Gaussianfunction of the distance between the index vectors r and s,hrs = c exp �kr� sk22 �2h ! ; (5.31)where the normalization constant c is needed to satisfy (4.4). This set of transition probabili-ties corresponds to a \square grid" of clusters and is commonly used in applications of SOM.Figure 5.4 shows snapshots of a combined \heating" and \cooling" experiment which is bestdescribed in terms of the temperature T := 1=�.For the \heating" process annealing starts at a low temperature T = 0:0002 with randomlyinitialized cluster centers and then the temperature is increased according to an exponentialscheme. Figures 5.4 a) - e) display a series of �ve snapshots of cluster centers during \heating".Defects of the grid, which indicate a local minimum of the cost function, are introduced bythe random initialization of the cluster centers and are preserved at low temperatures. AsT is gradually increased shallow local minima vanish and the grid becomes more and more

CHAPTER 5. PHASE TRANSITIONS IN STVQ 39d) T = 0:0617

a) T = 0:0002

e) T = 0:1004

b) T = 0:0013

f) T = 0:01

c) T = 0:0099

Figure 5.4: "Melting" of topological defects. The plots show snapshots of cluster centers for atwo-dimensional 8�8 cluster array and a two-dimensional data space using STVQ at di�erenttemperatures T . Dots indicate cluster centers with those centers connected by lines whichcorrespond to pairs of clusters for which the transition probability hrs is highest. Startingfrom a local minimum of the cost function introduced by random initialization and preservedat low temperature as seen in Figure 5.4a), the temperature T is increased exponentiallyaccording to T (t+1) = 1:01�T (t). Figures 5.4b) - e) illustrate the corresponding \melting" oftopological defects. Figure 5.4 f) shows the positions of the cluster centers after \re-cooling"to T = 0:01. The Gaussian neighborhood function has standard deviation �h = 0:5 and theinput data consist of 64 data points on a square grid in the unit square.ordered. Finally, a topologically ordered state is reached, which corresponds to the globalminimum of the free energy. Because T governs the resolution of the representation in dataspace, rather localized defects melt away at low temperature which corresponds to a highresolution in data space, while global twists melt away last.During \cooling" the temperature T is decreased starting from a very high value (T = 0:1)which corresponds to a state of the system where all cluster centers are merged at the centerof mass of the data distribution. Annealing is performed according to the reverse \heating"schedule and terminates at T = 0:0002, which corresponds to the global minimum of thefree energy and which is shown in Figure 5.4 f). Note that an ordered two-dimensional gridof cluster centers is established at the initial phase transition and remains in the orderedcon�guration throughout the \cooling" process. Figure 5.5 shows the average cost U , ameasure for the quality of the data representation, as a function of the temperature T forboth annealing experiments from Figure 5.4, \heating" and \cooling".Figure 5.6 displays C := dU(T )=dT , the derivative of the average cost with respect to thetemperature, as a function of T for \heating". C is equivalent to the heat capacity in ther-

CHAPTER 5. PHASE TRANSITIONS IN STVQ 401 2 3 4 5 6

0

1

2

3

4

5

ln(T/ T0)

⟨E⟩

Figure 5.5: Semi-logarithmic plot of the average assignment cost U = hEi as a function of tem-perature T for the cluster array of Figure 5.4. The upper curve shows U(T ) for the exponential\heating" schedule from T = 0:0002 to T = 0:1, starting from the local minimum of the costfunction shown in Figure 5.4 a). The steps in U(T ) occur at temperatures where \twists" inthe spatial arrangements of cluster centers unfold. The lower curve shows U(T ) for the samescheme now applied in \cooling"-direction from T = 0:1 to T = 0:0002. During \cooling", thecluster centers remain in a \topologically ordered" arrangement (cf. Figures 5.4 e),f) ). Thenormalization constant is T0 = 0:0002, other parameters as in Figure 5.4.modynamics and can be interpreted as a measure for the progress made in the quality of datarepresentation per change in temperature during annealing (see [109] for a discussion of heatcapacity as a general statistical measure). C(T ) exhibits pronounced peaks at temperatureswhich correspond to the \steps" in U(T ) during the annealing at which rearrangements ofthe cluster centers occur. This behavior is analogous to that of physical systems that undergophase transitions and re ects in our case a qualitative change in the assignment cost triggeredby a small quantitative change in T . The heat capacity C(T ) may also serve to determinea reasonable annealing schedule in the temperature parameter because it indicates criticalpoints during the annealing.5.3.3 Automatic Selection of Feature Dimensions for a Chain of ClustersConsider a data set of 2000 data points drawn from a homogeneous probability distributionde�ned on a two-dimensional rectangular data space of length `x = 12:8 and a variable width`y = 2p3 �y, where �2y is the variance of the probability distribution along the y-axis in data

CHAPTER 5. PHASE TRANSITIONS IN STVQ 411 2 3 4 5 6

−8

−4

0

4

8

ln(T/ T0)

C

Figure 5.6: Semi-logarithmic plot of the heat capacity C(T ) := dU(T )=dT as a functionof temperature T for the \heating" as shown in Figures 5.4a) - e) and Figure 5.5 (uppercurve). The temperatures corresponding to the peaked minima of the heat capacity indicatetransition points of the array of cluster centers as observed in Figure 5.4. Parameters as givenin Figures 5.4 and 5.5.space. A set of N = 128 clusters is labeled by indices r = f1; 2; � � � ; Ng. The transitionprobabilities hrs are chosen from a Gaussian function of the distance between indices r ands, hrs = c exp0B@��min (kr� sk ; N � kr� sk )�22 �2h 1CA ; (5.32)where c normalizes the probabilities according to (4.4) This set of transition probabilitiescorresponds to a linear chain of clusters. A one-dimensional chain in a two-dimensional dataspace constitutes the simplest non-trivial case for which (5.23) has been derived.Since (5.23) has been derived for a longitudinal space of in�nite size and in the continuumlimit, periodic boundary conditions were imposed in the longitudinal x-dimension of dataspace and on the transition probabilities hrs. The cluster centers were initialized accordingto (5.10) (see Figure 5.1a) ) with � = 10:0. The size of the system to be examined wasimportant in two aspects. The number of clusters was chosen as large as computationallyfeasible in order to reduce �nite size e�ects on the mode spectrum as well as in order for thecontinuum approximation to be valid. The number of data points was chosen such that localinhomogeneities would not strongly bias the result whilst keeping the computation time still

CHAPTER 5. PHASE TRANSITIONS IN STVQ 42tractable. Figure 5.1 b) shows the spatial distribution of cluster centers after the variance �2yhas been gradually increased from �2y = 0:0 to �2y = 3:24 beyond the phase transition. Thechain folds into the excess-dimension y in a wave-like shape with a dominant wavelength ��.This is well illustrated in Figure 5.7, which depicts the power in each of the �rst �ve Fouriermodes as a function of �y. At the critical value ��y = 1:27 the critical mode k� = 3 increasesin power and { �nally { dominates the spatial arrangement of the cluster centers.0 0.5 1 1.5 2

0

5

10

15

20

25| w

ky |2

σy

k = 0

k = 1

k = 2

k = 3

k = 4

Figure 5.7: Plot of the squared absolute amplitudes kwykk2 of transversal Fourier modes k asfunctions of the standard deviation �y of the data for the chain of N = 128 cluster centersshown in Figure 5.1. Only the �ve modes with the largest wavelength are shown. Beyondthe phase transition at ��y = 1:27 the k = 3 mode is selected and the chain folds into a sine-wave like curve. Parameters are � = 1:3, � = 10:0, and �h = 5:0. The 2000 data points aredistributed uniformly in the data plane given by [�6:4; 6:4]� [�`y=2; `y=2], where `y = 2p3�yis the width of the data distribution in y-direction.Figure 5.8 shows the average cost U = hEi and its derivative w.r.t. �y as functions of �yfor the numerical experiment shown in Figure 5.7. At the critical standard deviation ��y a kinkoccurs in the derivative. The position of this kink was used to obtain the numerical results ofFigure 5.9, which compares the theoretical values for ��y (solid line) obtained from (5.23) withthose that were obtained from the numerical simulations (error bars). The numerical resultsare in good agreement with the theoretical values obtained in the previous section, which {in hindsight { justi�es the approximations employed in the derivation of (5.23).Similar transitions in the data representation occur during annealing in T for �xed �yand �h. It can be observed from Figure 5.10, which shows the heat capacity C(T ) for such acase, that a stepwise decrease in T leads to a smooth change of representation from the initialstate (left inset) to a folded state (right inset) of the chain. This observation is of interest

CHAPTER 5. PHASE TRANSITIONS IN STVQ 430 0.5 1 1.5 2

0

0.4

0.8

1.2

1.6

σy

⟨E⟩

↓

d⟨E⟩ / dσy

⟨E⟩

d ⟨E⟩ / dσ

y

Figure 5.8: Plot of the average cost U = hEi and its derivative (scaled by an arb. const.) asfunctions of the standard deviation �y of the data set in y-dimension for the chain of N = 128cluster centers. The slope of the average cost shows a clear change at the critical value of �y.Interpolating between �y at the minimum and �y at the maximum of the derivative yieldsthe critical value ��y . The arrow indicates the theoretical prediction for the critical standarddeviation ��y = 1:27. Parameters as given in Figure 5.7.with regard to neural development in biological systems [92, 5]. Interpreting T = 1=� as anoise parameter leads to the idea that the development of cortical maps may be triggered bya reduction of neuronal noise rather than { as is the widely accepted view [28] { by a changein the variance of the input data.

CHAPTER 5. PHASE TRANSITIONS IN STVQ 44

0.8 1 1.2 1.4 1.6 1.8 2 2.21

1.1

1.2

1.3

1.4

1.5

1.6

β

σ

* y

Figure 5.9: Plot of the critical standard deviation ��y as a function of the temperature param-eter � for the chain of N = 128 cluster centers. The standard deviation �y of the data set inthe transversal y-dimension is linearly increased for �xed � and the critical value ��y obtainedfrom the derivative of the average cost as shown in Figure 5.8. The upper bound of the errorbars is taken from the position of the minimum and the lower bound from the position of themaximum of dhEi=d�y. Parameters as given in Figure 5.7.

CHAPTER 5. PHASE TRANSITIONS IN STVQ 45

0.70.80.911.11.2800

1000

1200

1400

1600

1800

T

C

↓ ↓ Figure 5.10: Plot of the heat capacity C(T ) := dU(T )=dT as a function of the temperature Tfor the chain of N = 128 cluster centers. Starting from the initial state of the chain at hightemperature (left inset) the temperature T is reduced in linear steps in � = 1=T , for �xed�y . As T is lowered the heat capacity C increases, the average cost U is reduced faster, andthe chain is continuously transformed into a folded con�guration of the cluster centers (rightinset). The vertical arrows indicate the corresponding temperatures, T = 1:25 and T = 0:714for the left and the right inset, respectively. Parameters are given by �y = 1:3 and �h = 5:0.

Chapter 6Kernel-Based Soft TopographicMapping6.1 Clustering in High Dimensional Feature SpaceIn this chapter a generalization of topographic clustering is presented, which makes it possibleto e�ectively carry out the clustering in a possibly very high dimensional feature space, whichis related to the input space by some nonlinear map. The idea was inspired by the recentrevival of a method �rst proposed by Aizerman et al. [1], which is now used in the context ofSupport Vector machines [127, 22, 110] and nonlinear PCA [112]. The basic insight is thatunder certain conditions inner products �(xi) � �(xj) in a high dimensional space F relatedto data space <d by a nonlinear mapping � : <d 7! F can be computed e�ciently via a kernelfunction k(xi;xj) in <d. Conversely, given certain kernel functions k(xi;xj) in data space<d, it can be shown that these correspond to inner products �(xi) � �(xj) in another spaceF , related to data space by the mapping �. As pointed out in [112] this trick can be appliedwhenever it is possible to express an algorithm solely in terms of inner products between datavectors. Replacing these inner products by the kernel functions k(�; �) is then equivalent tocarrying out the algorithm in the inner product space F , which allows to take into account,e.g., higher order correlations between input features. Thus the mapping � can be seen asa kind of preprocessing to bring out features of the data, which would go unnoticed if thealgorithm was run simply in data space. In the following, kernel-based methods are applied tosoft topographic vector quantization, thus making a new class of distance measures availablefor topographic mappings. The idea is based on a suggestion of Sch�olkopf et al. [112] fork-means clustering [81]. Since the kernel trick is central to the results of this chapter, it willbe introduced before the derivation of the algorithm.6.2 The Kernel Trick and Mercer's Condition6.2.1 The Kernel TrickThe kernel trick [127, 22, 110, 112] or potential function method [1] is an e�cient way ofcalculating inner products �(x) ��(y) in a feature space F , which is related to the data space<d by a mapping �, � : <d 7! F ; (6.1)where F is some (possibly in�nite dimensional) Hilbert space. The idea is to express theinner product (�) : F �F 7! < in feature space in terms of a kernel function k : <d�<d 7! <

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 47in data space, �(x) � �(y) = k(x;y) : (6.2)Conversely, it is of interest, if a given kernel function k(x;y) corresponds to an inner product�(x) ��(y) for a mapping � to some space F . Before this question is answered, let us consideran example. Let x;y 2 <2 and k(x;y) = (x � y)2. Then it is easily seen that for�(x) = 0B@ x21x22p2x1x2 1CA (6.3)the relation (6.2) holds. This mapping is illustrated in Figure 6.1, which shows that the datavectors �(x) in F only occupy a submanifold of dimension d at maximum. Note that (6.3)is not the only space, for which relation (6.2) holds. It is, however, the minimal embeddingspace, i.e. the one with minimal dimensionality.→→→→→→→→→→→→→→→→→→→→→→→→→→→→→

−1 0 1−1

0

1

x1

x2

00.5

1 00.5

1x

12

x22

αx1x

2

Figure 6.1: Illustration of a mapping from data space (left) to a higher dimensional featurespace (right). Calculating the kernel function in <2 is equivalent to calculating the innerproduct in F . The two connected points on the right are the images of those on the left.Clearly, the distance between points changes under the mapping �, which in this example isgiven by � : <2 7! F � <3; �(x) = (x21; x22;p2x1x2)T . Shown is the mapping � of the square[�1; 1]� [�1; 1] 2 <2.6.2.2 Mercer's ConditionIn general, the question for which kernel functions there exists a pair fF ;�g with the abovedescribed properties is answered by Mercer's condition [23]: There exists a mapping � andan expansion (inner product)k(x;y) =Xi �i(x)�i(y) = �(x) � �(y) (6.4)

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 48if and only if the kernel k(�; �) is positive semide�nite, i.e., it ful�llsZ g(x)k(x;y)g(y)dxdy� 0 (6.5)for any square integrable function g(x),Z g2(x)dx <1 : (6.6)Although it may not be easy in general to check the positive semide�niteness of a given kernelfunction, quite a few useful kernel functions do satisfy condition (6.5), some of which will bediscussed in the following.6.2.3 Admissible Kernel FunctionsTypical kernels that have been used so far include [112, 127]1. the polynomial kernel k(x;y) = (x � y)p2. the sigmoidal kernel k(x;y) = tanh(�x � y+ �)3. the Radial basis function kernel k(x;y) = exp(�kx� yk2=2�2)Let us �rst discuss the polynomial kernel, which corresponds to a �nite dimensional featurespace F . The polynomial kernel of degree p acting on data in <d is given byk(x;y) = (x � y)p ; x;y 2 <d ; (6.7)and the previous example corresponded to the case p = 2, d = 2. Introducing z� = x�y� onecan see that each dimension of the feature space F is related to a term with given powers ofz� in the expansion of k(x;y). Denoting the dimensions of � by these powers r� in which thez� appear, the mapping corresponding to (6.7) can be expressed as follows [19]:�r1r2��rd = s p!r1!r2! � � �rd! xr11 xr22 � � �xrdd ; dX�=1 r� = p; r� 2 N : (6.8)This means that the space F corresponding to (6.7) is spanned by all monomials of degree p.Also, it is the minimum embedding space and has dimensionality (p+d�1p ). This result hintsat the combinatorial explosion to be encountered when trying to operate in F directly.The sigmoidal kernel has been introduced to solve classi�cation tasks using the kerneltrick, and it bears resemblence with the sigmoidal activation function used in arti�cial neuralnetworks. It can be shown, however, thatk(x;y) = tanh(�(x � y) + �) (6.9)does not satisfy Mercer's condition for all parameter values � and �.The Gaussian radial basis function kernelk(x;y) = exp(�kx� yk2=2�2) (6.10)also has been used successfully in supervised learning tasks with the Support Vector Machine.It corresponds to an in�nite dimensional feature space F , because its expansion in polynomialshas an in�nite number of terms. For this type of kernel it is, consequently, impossible toperform an algorithm in F directly.

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 496.3 Derivation of Kernel-Based Soft Topographic Mapping6.3.1 Topographic Clustering in Feature SpaceLet � : <d 7! F be a mapping from data space <d to a possibly very high dimensional featurespace F . The cost function for topographic vector quantization (see (4.1) in Chapter 4) in Fis then given by ETMK(M; ~w);= 12 DXi=1 NXr=1mir NXs=1 hrs k�(xi)� ~wsk2 (6.11)under the constraints NXr=1mir = 1 ; 8i ; and NXs=1hrs = 1 ; 8r ; (6.12)where �(xi) 2 F are the images of the input vectors xi 2 <d; i 2 f1 : : :Dg, N is the numberof clusters, ~wr 2 F are the cluster centers, mir are binary assignment variables, and hrstransition probabilities. Note that (6.11) is a generalization of the cost function for TVQ(4.1), which is recovered if � is taken to be the identity mapping.Following the same derivation as in Chapter 4 one obtains stationarity conditions for thecluster centers ~wr ~wr = PDi=1 �(xi)PNs=1 hrshmisiPDi=1PNs=1 hrshmisi 8r ; (6.13)where hmiri is the average of the binary assignment variable mir and constitutes the assign-ment probability P (i 2 Cr) of data point i to cluster r given byhmiri = exp (��eir)PNt=1 exp (��eit) 8i; r : (6.14)The partial assignment costs eir areeir = 12 NXs=1 hrs k�(xi)� ~wsk2 : (6.15)In principle, this would be su�cient to formulate an EM algorithm with (6.14) as the ex-pectation step (E step) and (6.13) as the maximization step (M step) as done in Chapter 4for the STVQ algorithm. If, however, the feature space F is of very high dimensionality,the procedure of explicitly mapping the data vectors to F and performing the clustering inF would be computationally intractable. In order to solve this problem, the kernel trick asintroduced in Section 6.2 will be applied.6.3.2 Application of Kernel TrickTo avoid the explicit mapping xi 7! �(xi) to F and the calculation of the squared Euclideandistance in F the cluster centers ~wr are expressed as linear combinations of the transformedinput vectors �(xi), ~wr = DXi=1 air�(xi) ; 8r : (6.16)

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 50This restricts the ~wr to the subspace of F spanned by the �(xi), but this restriction holdsfor every minimum of (6.11), because cluster centers outside this subspace would in any caseincrease the total cost. Comparing (6.16) with (6.13) one obtains for the coe�cients air,air = PNs=1 hrshmisiPDj=1PNs=1 hrshmjsi ; 8i; r : (6.17)Now the partial assignment costs can be written solely in terms of inner products of datavectors in F and thus in terms of kernel functions in data space. One obtainseir = 12 NXs=1hrs k�(xi)� ~wsk2= 12 NXs=1hrs 24�(xi) ��(xi)� 2�(xi) � DXj=1 ajs�(xj) + DXj;k=1�(xj) ��(xk)ajsaks35= 12 NXs=1hrs 24k(xi;xi)� 2 DXj=1 ajsk(xi;xj) + DXj;k=1 k(xj ;xk)ajsaks35 ; (6.18)using the kernel trick, �(x) � �(y) = k(x;y).6.3.3 EM and Deterministic AnnealingThe formulation just presented leads to an e�cient EM scheme for the computation of theassignment probabilities hmiri at a given value of �. In the E step (6.14) the assignmentprobabilities hmiri are estimated based on the previous estimate of the partial assignmentcosts eir. In the M step (6.18) the partial assignment costs eir are recalculated in terms of thecoe�cients air, which are obtained from the previous estimate of the assignment probabilitieshmiri in (6.17)In order to avoid convergence to local minima, deterministic annealing in the parameter �is employed as discussed in Chapter 3 and thus one obtains the kernel-based soft topographicmapping (STMK). The following diagram gives an overview of the STMK algorithm in pseudocode: Kernel-Based Soft Topographic Mapping (STMK)initializeeir nir; 8i; r, nir 2 [0; 1] random numbercalculate lookup tables for hrs and k(xi;xj)choose �start; �final, �, and �� startwhile � < �final (Annealing)repeat (EM)E step: calculate hmiri ; 8i; r using Eq. (6.14)M step: calculate anewir ; 8 i; r using Eq. (6.17)calculate enewir ; 8 i; r using Eq. (6.18)until kenewir � eoldir k < � ; 8 i; r� � �end

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 51Note that the SOM-approximation as introduced for STVQ in Chapter 4 can also beapplied to STMK. One can leave out the convolution with hrs in (6.18) and save computationtime at the cost of the convergence guarantee to a local minimum of the free energy in theEM algorithm.6.3.4 Critical Temperature of the First Phase TransitionAs has been discussed in Chapter 5 for soft topographic vector quantization, the annealingprocess induces phase transitions in the cluster representation in that the cluster centers splitup in order to represent the data space. These splits are related to qualitative changes in theoptimization problem and have to be taken into account in the annealing process. In order toavoid wasting computation time one should start the annealing at a value �start � ��, where�� is the value at which the �rst split in the representation occurs. Essentially following thederivation in Chapter 5 for the initial phase transition, the critical parameter value �� forthese phase transitions can be expressed in terms of the largest eigenvalues of matrices G andC, �� = 1�Gmax�Cmax ; (6.19)where the elements of G are given by grs =PNt=1 hrt hhts � 1N i, which is the same expressionobtained for STVQ in Chapter 5. C is the covariance matrix, C = 1D PDi=1 ~�(xi)~�T (xi),of centered images ~�(xi) = �(xi) � 1D PDi=1�(xi) of the data points. From singular valuedecomposition it can be seen that the matrix of inner products ~K, ~kij = 1D ~�(xi) � ~�(xj) hasthe same non-zero eigenvalues as C. Thus �Cmax = � ~Kmax, where ~K with elements ~kij can becalculated using the kernel table kij = k(xi;xj) [112], ,~kij = 1D 24kij � 1D DXl=1 klj � 1D DXm=1 kim + 1D2 DXl;m=1 klm35 : (6.20)6.4 Numerical Simulations using the RBF Kernel6.4.1 E�ect of the RBF KernelThe numerical simulations focus on the RBF kernel,k(x;y) = exp �kx� yk22�2 ! : (6.21)The e�ect of its width � on the outcome of the topographic clustering is examined. The useof a kernel function instead of the regular inner product in data space can be thought of asintroducing a new distance measure in data space. In order to get a better understanding ofthe new distance measure let us introduce the RBF kernel into the squared Euclidean distance(6.18) which determines the partial assignment costs. For the case ~ws = �(xj), i.e., ais = 1for i = j and ais = 0 otherwise in (6.18), this givesk�(xi)� ~wsk2 = 2� exp �kxi � xjk22�2 ! : (6.22)

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 52The r.h.s. of (6.22) is plotted in Fig. 6.2 as a function of the distance kxi � xjk for threedi�erent values of �. It can be seen that for small values of � only very close vectors in uencethe assignment costs, while for vectors further away the assignment costs are almost constantw.r.t. distance. For high values of �, there is a signi�cant in uence also for vectors at a greaterdistance.0 1 2 3 4 5 6

0

0.5

1

1.5

2

2.5

3

|Φ(x

i) −

ws|2

|xi − x

j|

σ = 0.1σ = 1 σ = 10

Figure 6.2: Plot of the r.h.s. of (6.22) as a function of distance kxi � xjk in data space forthree di�erent values of �. The value of � determines the range relevant to the assignmentcosts of a cluster. At distances kxi � xjk >> � the assignment costs are constant.6.4.2 Simulation Results with Handwritten Digit DataIn the following, the e�ect of the parameter � on the generation of topographic maps ofhandwritten digit data from the United States Postal Service (USPS) in Bu�alo is studied.The data are given as 16�16 pixel gray value images corresponding to 256-dimensional vectorstaking values [�1; 1]256. The neighborhood function hrs was chosen such that it re ects thetopology of a 5 � 5 lattice of N = 25 clusters, with the coupling strength decreasing as aGaussian function of distance, hrs = c exp(�kr� sk2=2�2h) ; (6.23)and �h = 0:5. The transition probabilities hrs are normalized to unit probability over allclusters by c.

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 53Four topographic maps were generated, each time using the same D = 500 vectors(smoothed with a Gaussian of width 0.75) from the USPS data as input. The STMK al-gorithm described in the previous section was applied choosing an exponential annealingschedule [42] with � = 1:5 and �start < �� from (6.19). The convergence criterion for the EMalgorithm was given by � = 10�6. The resulting maps are shown in Figure 6.3. Since thecluster centers ~wr in the high dimensional space F can hardly be visualized the plot showsfor each cluster center its \projection" ~wr = PDi=1 airxi. The upper left map was generatedwith the polynomial kernel k(x;y) = (x � y)=256 corresponding to the standard result of softtopographic vector quantization [42]. The other three maps were generated using the radialbasis function kernel k(x;y) = exp(�kx� yk2=256�2) with � = f0:1; 1:0; 10g. In both casesthe inner products in the kernel functions were scaled with the dimensionality of the inputspace to make the parameters independent of it. Clearly, the maps in Figure 6.3 show topog-raphy, i.e. similar data vectors or digits with large overlaps are mapped to nearby clusters inthe lattice. In the map with � = 0:1 the cluster centers tend to occupy regions in data spacewith a high density of input points so as to minimize the amount of data points in the largearea for which the assignment cost is constantly maximal. The digit \1", which is representedin the top right corner of this map is such a region of high data density, because all digits \1"tend to look alike. The opposite is true for the case � = 10. Data vectors at greater distancein uence the cost of a cluster and thus the position of its cluster center. As a consequencethe cluster centers are positioned in such a way as to take into account the data points in thefull �-range. Hence, the parameter � makes it possible to �ne tune the range relevant to thedetermination of the cluster centers.6.4.3 Conclusion on STMKThe STMK algorithm that was introduced in this chapter is able to perform topographicclustering in high dimensional feature spaces at minimal computational extra cost as com-pared to STVQ. From a di�erent point of view STMK is an e�cient optimization scheme forclustering in the Euclidean data space with cost functions involving distance measures otherthan the squared Euclidean distance. Both points of view can be bene�cial depending on theapplications involved.There are many possible applications of STMK. One possibility is document clustering,where the documents are represented by binary vectors, whose non-zero entries are related toword occurrences of words from a key word list. The polynomial kernel would then incorpo-rate products of word occurrences, which for binary values correspond to logical ANDs. Thusthe higher order structure of the documents is made accessible to the clustering algorithm.Another possible application is unsupervised texture segmentation. Textures are character-ized by higher order statistics of the pixel representation and a suitable kernel could thusimplement a distance measure, which relates di�erent textures in a reasonable way.In summary, it remains to be seen which kernels are suitable for which applications. It isclear, however, that new exibility is gained by the wide choice of kernels and correspondingdistance measures.

CHAPTER 6. KERNEL-BASED SOFT TOPOGRAPHIC MAPPING 54Polynomial kernel d = 1 Gaussian kernel � = 0:1Gaussian kernel � = 1:0 Gaussian kernel � = 10Figure 6.3: Topographic maps of handwritten digit data using Gaussian RBF kernels ofdi�erent width. N = 25 clusters were coupled according to (6.23) with �h = 0:5. STMK wasapplied to the D = 500 data vectors of dimension d = 256 consisting of 16 � 16 gray valueimages with values [�1; 1]. The optimization parameters were � = 1:5 and � = 10�6. Shownare the projections ~wr =PDi=1 airxi of the cluster centers arranged in the lattice induced byH.

Chapter 7Soft Topographic Mapping forProximity Data7.1 Topographic Clustering on Pairwise Dissimilarity DataIn the �eld of unsupervised learning researchers have focussed on analysis methods for data,which are given as vectors in a space that is assumed to be Euclidean. Examples of thiskind include principal component analysis (PCA) [63, 121], independent component analysis(ICA) [6], vector quantization (VQ) [81], latent variable models [10], or Self-Organizing Maps(SOM) [70, 101]. Also the algorithms so far presented in this work, STVQ in Chapter 4 andSTMK in Chapter 6, belong to this group. Often, however, data items are not given as pointsin a Euclidean data space, but one has to restrict oneself to the set of pairwise proximitiesas measured in particular in empirical sciences like psychology, biochemistry, linguistics, oreconomics. Here, two strategies for data analysis have been pursued for some time: Pairwiseclustering, which detects cluster structure in dissimilarity data [53, 26], and multidimensionalscaling, which deals with the embedding of pairwise proximity data in a Euclidean spacefor the purpose of visualization [12]. Recently, both approaches were combined by Hofmannand Buhmann [54]. As an alternative, this chapter presents a generalization of STVQ topairwise proximity data. The resulting algorithm { soft topographic mapping for proximitydata (STMP) { creates topographic maps of pairwise proximity data. The approach is basedon a mean-�eld approximation, that makes it possible to calculate approximate averages w.r.t.the Gibbs distribution that results from the application of the principle of maximum entropyto the cost function (2.10) derived in Chapter 2.7.2 Derivation of Soft Topographic Mapping for ProximityData7.2.1 Mean-Field ApproximationConsider the cost function ETMP derived in Chapter 2 for the topographic mapping of prox-imity data, ETMP(M) = 12 DXi;j=1 NXr;s;t=1 mirhrsmjthtsPDk=1PNu=1mkuhusdij ; (7.1)

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 56under the constraints NXr=1mir = 1 ; 8i ; and NXs=1hrs = 1 ; 8r : (7.2)In order to apply the deterministic annealing scheme introduced in Chapter 3 again theprinciple of maximum entropy [62] is applied which yields a Gibbs distributionP (M) = 1ZP exp(��ETMP (M)) ; (7.3)where � is the inverse temperature and ZP the partition function, given byZP = XfMg exp(��ETMP (M)) (7.4)The summation in the partition function is over all \legal" assignment matrices fMg. Sincethe cost function ETMP (M) is not linear in the assignment variables mir this probabilitydistribution does not factorize and { as a consequence { it is di�cult to calculate averagesw.r.t. it. Following Saul and Jordan [107], a cost function linear in the assignment variables,E0(M; E) = DXi=1 NXr=1mireir ; (7.5)parameterized by partial assignment costs E = (eir)i=1;:::;D; r=1;:::;N 2 <D�N , leads to a prob-ability distribution Q(M; E),Q(M; E) = 1ZQ exp ��E0(M; E)� ; (7.6)which factorizes. The partial assignment costs E are determined such as to minimize theKullback-Leibler (KL) divergenceKL(QjP ) = XfMgQ(M; E) ln Q(M; E)P (M) : (7.7)The Kullback-Leibler divergence can also be expressed in terms of the free energies corre-sponding to the original cost function (7.1) and its approximation (7.5), FTMP and F 0,respectively,KL(QjP ) = XfMgQ(M; E) log exp(��E0(M))PfM0g exp(��ETMP (M0))exp(��ETMP (M))PfM0g exp(��E0(M0))= �(F 0 � FTMP + hETMP � E0i) ; (7.8)where the average is taken w.r.t. Q(M; E). Since KL(QjP ) � 0 ; 8Q;P , where the equalityholds for Q = P only, the well known upper bound on the free energy �rst derived by Peierls[96] is recovered, FTMP � F 0 + hETMP � E0i : (7.9)Note that this approach implicitly assumes that assignments of data items to clusters areindependent in the sense that hmirmjri = hmirihmjri, an assumption that is most likely to

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 57be valid in the case D � N . As is known from Hofmann et al. [58] more elaborate mean-�eldapproaches like the TAP method are conceivable, but contribute little to the development ofan applicable algorithm. Minimizing the upper bound (7.9) yields conditions@@ekv (F 0 + hETMP �E0i) = 0 ; 8k;v ; (7.10)from which one obtains@hETMP(M)i@ekv � NXr=1 @hmkri@ekv hekri = 0 ; 8k;v : (7.11)From these conditions the optimal mean-�elds ekr can be calculated as detailed in Ap-pendix 9.4. Note that the cost function (7.1) is invariant under the substitution dij =dji (dij + dji)=2. Making the simplifying assumption of zero self-dissimilarity, dii = 0,and neglecting terms of order O(1=D) one obtains for the optimal mean-�elds ekr,ekr = NXs=1hrs DXj=1 bjs dkj � 12 DXi=1 bisdij! ; (7.12)with weighting coe�cients bjs = PNt=1hmjtihtsPDl=1PNu=1hmluihus ; (7.13)and average assignment variableshmkri = exp(��hekri)PNs=1 exp(��eks) : (7.14)Please note the similarities of the last three equations (7.12), (7.13), and (7.14) withthe corresponding equations for STMK in Chapter 6, equations (6.18), (6.17), and (6.14),respectively. The E step of the two algorithms (7.14) and (6.14) has exactly the same form,and so do the weighting coe�cients bir (7.13) and air (6.17). Noting that the average clusterassignments(7.14) and (6.14) are invariant under the transformation eir ! eir + ci; 8ci 2 <,it can be seen that the equations for the mean-�elds in STMP (7.12) and those for the partialassignment costs in STMK (6.18) are equivalent for dij = �k(xi;xj). In this sense, STMKand STMP are equivalent if the dissimilarity value between data items corresponds to aninner product between feature representations of the data items. On the one hand, thisresult is surprising, because STMP and STMK were derived from quite distinct ideas { mean-�eld approximation to pairwise dissimilarities for STMP and clustering in high dimensionalfeature spaces using inner products and the kernel trick for STMK. However, inner productsare nothing but similarity measures in Euclidean spaces, and STMK can in this sense bethought of as a special case of STMP.7.2.2 EM and Deterministic AnnealingThe mean-�eld approximation to the free energy leads to an e�cient EM scheme for thecomputation of the assignment probabilities hmiri at a given value of �. In the E step (7.14)the assignment probabilities hmiri are estimated based on the previous estimate of the partial

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 58assignment costs ekr. In the M step (7.12) the partial assignment costs ekr are recalculated interms of the coe�cients air, which are obtained from the previous estimate of the assignmentprobabilities hmiri in (7.13). In order to avoid convergence to local minima, deterministicannealing in the parameter � is employed as discussed in Chapter 3 and thus one obtains thesoft topographic mapping for proximity data (STMP). The following diagram summarizes theSTMP-algorithm in pseudo-code:Soft Topographic Mapping for Proximity Data (STMP)initializeeir nir; 8i; r, nir 2 [0; 1] random numbercalculate lookup table for hrsprepare dissimilarity matrix dij from datachoose �start; �final, annealing factor �,and convergence criterion �� startwhile � < �final (Annealing)repeat (EM)E step: calculate hmiri ; 8i; r using Eq. (7.14)M step: calculate bnewir ; 8 i; r using Eq. (7.13)calculate enewir ; 8 i; r using Eq. (7.12)until kenewir � eoldir k < � ; 8 i; r� � �end7.2.3 SOM Approximation for STMPIn Chapter 4 a family of clustering algorithms was derived, among them the SSOM, whichis an e�cient approximation to STVQ. Let us now introduce an equivalent approximation toSTMP. The E step, equation (7.14), can be seen as a soft-min function w.r.t. the mean-�eldsekr. Leaving out the convolution with hrs thus leads to a new prescription for the calculationof the mean-�elds, eks = DXj=1 bjs dkj � 12 DXi=1 bisdij! ; (7.15)the SOM-approximation. This approximation is computationally more e�cient than the exactupdate given in (7.12). However, it has the drawback that the iteration scheme, (7.14) and(7.12), no longer performs an exact minimization of the free energy. However, the robustnessof the SOM-algorithm, being based on the same approximation [18], and numerical resultsdemonstrate the usefulness of the approximation.7.3 Critical Temperature of the First Phase TransitionAs has been discussed in Chapter 5 for STVQ and in Chapter 6 for STMK, the annealingin � induces changes in the cluster representation. Cluster centers in data space split alongthe principal axis of the data according to the eigen-structure of the coupling matrix H

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 59[42]. Although there exists no Euclidean data space in the dissimilarity approach, the criticalbehavior of the cluster assignments with decreasing temperature can be examined.Let us consider the case of in�nite temperature, � = 0. Using hmkri0 = 1=N in equation(7.14), the mean �elds e0kr for � = 0 are given bye0kr = 1D DXj=1 dkj � 12D DXi=1 dij! : (7.16)Then linearize the right-hand side of equation (7.12) around e0kr by performing a Taylor-expansion in emv � e0mv:ekr � e0kr = DXm=1 NXv=1 � @ekr@emv �e0kr (emv � e0mv) + � � � (7.17)Evaluation of this expression yieldsekr � e0kr = � DXm=1 NXv=1�km�rv(emv � e0mv) (7.18)with �km = 1D 0@ 1D DXi=1 dim + 1D DXj=1 dkj � dkm � 1D2 DXi;j=1 dij1A (7.19)and �rv =Xs hrs �hvs � 1N � : (7.20)Equations (7.18) can be decoupled by transforming the shifted mean-�elds ekr � e0kr into theeigen-bases of � and �. Denoting the transformed mean-�elds ~e�� one arrives at~e�� = ��~e�� (7.21)Assuming hrs = hsr this equation has only non-vanishing solutions for �� = 1, where ��and �� are eigenvalues of � and �, respectively. This means that the �xed point state fromequation (7.16) �rst becomes unstable during the increase of � at�� = 1��max��max ; (7.22)where ��max and ��max denote the largest eigenvalues of � and �, respectively. The instabilityis, of course, characterized by the corresponding eigenvectors v�max and v�max. While v�maxdetermines the mode in index space which �rst becomes unstable (see Chapter 5 for details),v�max can be identi�ed as the �rst principal coordinate from classical metric multidimensionalscaling [40, 12]. It is instructive to consider a special case of � to understand its meaning.Assume that the dissimilarity matrix D represents the squared Euclidean distances dij =jxi � xj j2=2 of D data vectors x with zero mean in an s-dimensional Euclidean space. Thenit is easy to show from (7.19) that �km = 1Dxk � xm : (7.23)

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 60In this case the D � D matrix � can at maximum have rank s. From singular value de-composition it can be seen that the non-zero eigenvalues of � are the same as those of thecovariance matrix C of the data. Since the eigenvectors of C correspond to the principal axesin data space, and its eigenvalues are the associated variances, the maximum variance in dataspace determines the critical temperature, whereby the instability occurs along the principalaxis. In the general case of dissimilarities, it can be concluded that that � determines thewidth or pseudo-variance of the ensemble of dissimilary items.The results of this section can be extended to the STMP with SOM-approximation. Thematrix � as given in (7.20) is modi�ed by omitting one convolution with hrs resulting in�Srv = �hvr � 1N � : (7.24)7.4 Numerical Simulations7.4.1 Toy Example: Noisy SpiralIn this section STMP is used to generate a topographic representation of a one-dimensionalnoisy spiral (Figure 7.1 left) in a three-dimensional Euclidean space using distance data only.100 data-points x were generated viax = sin(�) + nxy = cos(�) + nyz = �� + nz ; (7.25)where � = [0; 4�] and n is Gaussian noise with zero mean and standard deviation �n = 0:3.−1

01

−1

0

1

−1

0

1

2

3

4

5

xy

z

data points

da

ta p

oin

ts

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Figure 7.1: Plots of a noisy spiral (left) and corresponding dissimilarity matrix (right). 100data points were generated according to (7.25) with � = [0; 4�] and �n = 0:3. The dissimilaritymatrix was obtained as dij = jxi� xj j2=2 and was plotted such that the rows from top downand the columns from left to right correspond to the data points in order of their generationalong the spiral with increasing �.

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 61The dissimilarity matrix D was calculated from the squared Euclidean distances betweenthe data points, dij = kxi � xjk2=2, and is depicted in Figure 7.1 (right). The neighborhoodmatrix H was chosen such that it re ects the topology of a chain of N = 10 clusters, withthe coupling strength decreasing as a Gaussian function of distance,hrs = c exp(�kr� sk2=2�2h) ; (7.26)with �h = 0:5 and hrs normalized to unit probability over all clusters by c. Note, thatthis choice of �h corresponds to a very narrow neighbourhood. Without annealing it wouldlead to topological defects in the representation. STMP was applied both with and withoutthe SOM-approximation, choosing an exponential annealing schedule [42] with � = 1:1 and�start = 0:1.10

−110

010

110

20

20

40

60

80

100

120

140

β

<E

>

exact TMP updatewith SOM approx.

Figure 7.2: Plot of the average assignment cost U = hEi as a function of the temperatureparameter � for STMP with and without SOM-approximation applied to the dissimilaritymatrix of the noisy spiral from Figure 7.1. The topology of the clusters was that of a chaingiven by equation (7.26) with �h = 0:5. The annealing schedule was exponential with � = 1:1and the convergence criterion for the EM algorithm was � = 10�7. The average cost U wascalculated using equation (7.1) with the binary assignment variables mir replaced by theiraverages hmiri. The vertical line indicates the value �� = 0:71 as calculated from equation(7.22).As can be seen from Figure 7.2, both variants of STMP converge to the same �nal valueof the average cost function at low temperature. The �rst split occurs close to �� = 0:71, thevalue predicted by (7.22) indicated as a vertical line in the plot. Due to the weak coupling, the

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 62SOM-approximation induces only a slightly earlier transition at ��S = 0:70 in accordance with(7.24). STMP detects the reduced dimensionality of the spiral and correctly forms groupsalong the spiral. Figure 7.3 shows the assignment matrix of the data points in order of theirgeneration in the spiral to the chain of clusters. At high temperature (left) the assignmentsare fuzzy but the emerging topography is visible immediately after the phase transition.The diagonal structure of the assignment matrix at low temperature (right) indicates thetopography of the map, while the small defects stem from the Gaussian noise on the data.neurons

da

ta p

oin

ts

2 4 6 8 10

10

20

30

40

50

60

70

80

90

100

neurons

da

ta p

oin

ts

2 4 6 8 10

10

20

30

40

50

60

70

80

90

100Figure 7.3: Plot of the average assignments hmiri at high temperature � = 0:81 (left) andlow temperature � = 186:21 (right) for STMP without SOM-approximation applied to thenoisy spiral. Dark corresponds to high probability of assignment. Data and parameters as inFigure 7.2.7.4.2 Topographic Map of the Cat's Cerebral CortexLet us now consider an example, which cannot in any sense be interpreted as representing aEuclidean space. The input data consist of a matrix of connection strengths between corticalareas of the cat. The data was collected by Scannell et al. [108] from text and �gures ofthe available anatomical literature and the connections are assigned dissimilarity values d asfollows: self-connection (d = 0), strong and dense connection (d = 1), intermediate connection(d = 2), weak connection (d = 3), and absent or unreported connection (d = 4). While thisdata was originally analysed as ordinal data [108], here the stronger assumption is made thatthe dissimilarity values represent a ratio scale. Since the \true" values of the connectionstrength are not known, this is a very crude approximation. However, it serves well fordemonstration purposes and shows the robustness of the described method. While the originalmatrix d0ij was not completely symmetrical due to di�erences between a�erent and e�erentconnections, the application of STMP is equivalent to the substitution dij = (d0ij + d0ji)=2.Since the original matrix was nearly symmetrical, this introduces only a small mean squaredeviation per dissimilarity from the true matrix ( 1D2 Pi;j(dij�d0ij)2 � 0:1). The topology waschosen as a two-dimensional map of clusters coupled in accordance with equation (7.26) withtwo-dimensional index vectors and �h = 0:4. Figure 7.4 shows the dissimilarity matrix withcolumns and rows sorted according to the STMP assignment results. The dominant block

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 63

data points

da

ta p

oin

ts

10 20 30 40 50 60

10

20

30

40

50

60Figure 7.4: Dissimilarity matrix of areas of the cat's cerebral cortex. The areas are sortedaccording to their cluster assignments from top down and from left to right. The horizontaland vertical lines show groups of areas as assigned to clusters. Dark means similiar. Thetopology of the clusters was that of a 5 � 5 lattice as given by (7.26) with �h = 0:4. Theannealing scheme was exponential with � = 1:05 and �start = 2:5 < 2:7265 = ��. Theconvergence criterion for the EM-algorithm was � = 10�10.

CHAPTER 7. SOFT TOPOGRAPHIC MAPPING FOR PROXIMITY DATA 64diagonal structure re ects the fact that areas assigned to the same cluster are very similar.Additionally, it can be seen that areas which are assigned to clusters far apart in the latticeare less similar to each other than those assigned to neighboring clusters. Figure 7.5 displaysthe areas as assigned to clusters on the map by STMP. Four coherent regions on the map canbe seen to represent four cortical systems: visual, auditory, somatosensory, and frontolimbic.The visual areas 20b and PS are an exception and occupy a cluster which is not part of themain visual region. Their position is justi�ed, however, by the fact that these areas have manyconnections to the frontolimbic system. In general it is observed that primary areas such asareas 17 and 18 for the visual system, areas 1, 2, 3a, 3b, and SII for the somatosensory system,and areas AI and AII for the auditory system are placed at corners or at an edge of the map.Higher areas with more crosstalk are found more centrally located on the map. An example isEPp, the posterior part of the posterior ectosylvian gyrus, a visual and auditory associationarea [108] represented at the very center of the map with two direct visual neighbors. Forcomparison, Figure 7.6 shows the solution of metric multidimensional scaling [40, 12] on thesame dissimilarity matrix. Although more of the local dissimilarity structure is preservedin the two-dimensional embedding, the regions are not clearly separated. In summary, themap in Figure 7.5 is a plausible visualization of the connection patterns found in the cat'scerebral cortex. It is clear, however, that the rather arbitrary and coarse topography of the5 � 5 square map cannot fully express the rich cortical structures. Prior knowledge aboutthe connection patterns { if available { could be encoded in the topology of the clusters toimprove the representation.7.4.3 Conclusion on STMPThe STMP algorithm introduced in this chapter generalizes topographic clustering to ar-bitrary distance measures in a mean-�eld approximation. The distance measure dij in thedata set as well as the coupling hrs on the set of clusters can be chosen freely and it is this exibility that opens up many applications in empirical sciences like psychology, economicsor biochemistry.More generally, the characterization of a set of data items by their pairwise dissimilaritiesis a very natural representation for discrimination and clustering tasks, that makes fewerassumptions than, e.g., a vector space representation [38]. In particular, structured objects,for which a feature representation is di�cult to de�ne, can often be characterized by theirpairwise dissimilarities and can thus be analysed using STMP.Finally, the similarity between STMP and STMK hints at the close relation betweenpairwise dissimilarities and vector space representations with inner products. STMP acts ondissimilarity data directly and thus avoids the preprocessing step of embedding the data in aEuclidean space. This procedure can also be applied to other problems like classi�cation andregression, and this is a topic of ongoing research.


1718AMLS21a21b PLLSDLSAES 19ALGSVA 4 3a3b12SIIPMLSVLS20a ALLS 75bl 5al5mSSAo 4g5am5bmSSAiAIAIIAAFP VPV SSFEPp 6l6m SIVPOARSDP 36Tem 20bPS Ig PFCrPFCdlCGa35ERHipp PSbAmyg ILSb IaLAPL PFCvPFCdmCGp

Figure 7.5: Connection map of the cat's cerebral cortex. The map shows 65 cortical areasmapped to a lattice of 5� 5 clusters. The four cortical systems { frontolimbic (||), visual(| |), auditory (� � �), and somatosensory (- -) { have been mapped to coherent regions exeptfor the visual areas 20b and PS, which occupy a cluster apart from the main visual region.Parameters as in Figure 7.4.


V−17 V−18

V−19

V−PMLS

V−PLLS

V−AMLS

V−ALLS

V−VLS

V−DLS

V−21a

V−21b

V−20a

V−20b

V−ALG

V−7

V−AES

V−SVA

V−PS

A−AI

A−AII

A−AAF

A−DP A−P

A−VP A−V

A−SSF

A−EPp

A−Tem

S−3a S−3b S−1 S−2

S−SII S−SIV

S−4g S−4

S−6l

S−6m

S−POA

S−5am

S−5al

S−5bm

S−5bl

S−5m S−SSAo S−SSAi

F−PFCr F−PFCdl

F−PFCv

F−PFCdm F−Ia

F−Ig

F−CGa

F−CGp

F−LA

F−RS

F−PL

F−IL

F−35 F−36

F−PSb F−Sb

F−ER

F−Hipp

F−Amyg

Figure 7.6: Metric multidimensional scaling solution for the dissimilarity matrix of the cat'scerebral cortex. The plot shows the embedding of 65 cortical areas in two dimensions. Theregions to which the areas belong can be identi�ed from the �rst letter: frontolimbic (F),visual (V), auditory (A), and somatosensory (S). The embedding is achieved by determiningthe eigenvectors of the matrix � (7.19) and by projecting the column vectors of �, eachcorresponding to a data item, onto the eigenvectors with the largest associated eigenvalues.Shown is a scatter plot of the projections onto the two eigenvectors associated with the twolargest eigenvalues scaled by the square root of the respective eigenvalue.

Chapter 8Conclusion8.1 ConclusionThe central theme of this work was topographic clustering. Starting from the idea of a prob-abilistic autoencoder, two cost functions for clustering were derived, one for Euclidean datavectors and one for arbitrary dissimilarity data. The multimodality of these cost functionsmade it necessary to �nd an appropriate optimization scheme that strikes a balance betweenoptimality and computational costs. For soft topographic vector quantization (STVQ) de-terministic annealing, which is based on principles from statistical physics, was introducedand combined with the expectation maximization algorithm, which is a well known tool fromstatistics for �nding maximum likelihood solutions in problems with missing data.The variation of the temperature parameter during optimization leads to phase transitionsin the cluster representation. This gave rise to a detailed analysis of the annealing processand of the phase transitions involved. The phase transitions were characterized in terms ofproperties of the data as well as of the preassigned cluster structure.In order to perform topographic clustering in high dimensional feature spaces, the kerneltrick was applied, which makes it possible to calculate inner products in feature space bycalculating the value of a kernel function in data space. This method makes it possibleto take into account higher order correlations between the input features, i.e. to change thedistance measure used in data space, while retaining the advantages of deterministic annealingand EM for the kernel-based soft topographic mapping (STMK).Data may be represented not by Euclidean feature vectors but by pairwise dissimilari-ties between data items. This more general approach was taken to develop a topographicmapping for proximity data based on deterministic annealing (STMP). Interestingly, thislatter approach, which was based on a mean-�eld approximation, turned out to incorporatethe kernel-based algorithm STMK as a special case, when the dissimilarities were chosen asnegative inner products between data vectors in a Euclidean space.In summary, ideas from statistical physics, statistics, and neuroinformatics were combinedto create new ways of clustering data for unsupervised data analysis and visualization as wellas for data compression.8.2 Future WorkA plethora of new ideas has grown out of this work. One starting point is the autoencoderconcept, which opens up many other possibilities besides its role in the present work. Morecomplex models could be used for the probabilistic encoders and could lead to a more ex-

CHAPTER 8. CONCLUSION 68pressive representation of the data. Autoencoders also form a conceptual bridge betweenunsupervised and supervised learning. This relation could be used to transfer ideas aboutgeneralization and learning dynamics from the realm of supervised learning, where thesetopics are well established, to unsupervised learning.Also, the deterministic annealing scheme for optimization, which has proved to be e�cientin the clustering context, could be used for other nonlinear optimization problems, e.g. thosefound in supervised learning. This idea has been tried by [88] for di�erent architectures, butsince nonlinear optimization is such a widespread problem, it should be worthwile to examinethe applicability of deterministic annealing systematically.Another set of ideas is motivated by the fact that for pairwise dissimilarity data thedissimiliarity matrix grows quadradically with the number of data items. In order to keep theproblem computationally tractable, one must incorporate mechanisms that deal with missingdata. Obviously, this always involves assumptions about the data that should be compatiblewith the problem at hand [122]. Another, possibly more e�cient way of dealing with thelarge amount of data is active data selection, i.e. to query those data which are most likely topromote the learning process. Di�erent criteria [80, 55] have been suggested to this end andthis is certainly an interesting direction for further research.The general idea of describing a set of data items in terms of their mutual dissimilaritieswithout direct reference to their original representation can be used in supervised learningproblems like classi�cation or regression, as well. Assumptions or prior knowledge aboutthe problem can then be incorporated in the distance measure used, thus avoiding implicitassumptions introduced by e.g. vector space representations. Currently, this idea is activelypursued and results are under way. Also, the standard classi�cation paradigm, in whichequivalence relations are learned, can be complemented by other learning objectives. Anexample is the learning of preference relations which we recently investigated for the purposeof information retrieval [49].

Chapter 9Appendix9.1 Proof of the Fixed-Point PropertyA set of cluster centers fw0rg quali�es as a �xed-point of (5.8) if it satis�esw0r = R P (x)xPs hrs P 0(x 2 Cs) dxR P (x)Ps hrs P 0(x 2 Cs) dx ; 8r ; (9.1)where P 0(x 2 Cs) = exp ��2 Pt hstkx�w0tk2�Pu exp ��2 Pt hutkx�w0tk2� : (9.2)Let us �rst consider the transversal dimensions. Inserting (5.10) into (9.1) yieldsw?0r = R P (x)x?Ps hrs P 0(x 2 Cs) dxR P (x)Ps hrs P 0(x 2 Cs) dx != 0? ; 8r : (9.3)Using P 0(x 2 Cs) = P 0(xk 2 Cs) one obtains for the numerator of (9.3)Z P (x)x?Xs hrs P 0(xk 2 Cs) dx = Z P (xk)Xs hrs P 0(xk 2 Cs) dxk Z P (x?)x? dx? = 0? ;(9.4)because the mean of P (x?) was assumed to be zero. Hence (9.3) is satis�ed.For the evaluation of the longitudinal dimensions again insert (5.10) into (9.1) to obtainwk0r = R P (x)xkPs hrs P 0(x 2 Cs) dxR P (x)Ps hrs P 0(x 2 Cs) dx != ��1r ; 8r : (9.5)Equation (9.5) can be written as an average of R Qr(xk)xk dxk over a probability distributionQr(xk) given byQr(xk) = P (xk)Ps hrs P 0(xk 2 Cs)R P (xk)Ps hrs P 0(xk 2 Cs) dxk = NnP (xk)Xs hrs P 0(xk 2 Cs) ; (9.6)where in the second step the identityZ P (xk)Xs hrs P 0(xk 2 Cs) dxk = 1Nn (9.7)

CHAPTER 9. APPENDIX 70has been used. Equation (9.7) can be shown by summing both sides over r yielding unity.To demonstrate the validity of (9.5) one only needs to show that Q(xk) is symmetrical w.r.t.wk0r = ��1r. Since P (xk) is homogeneous, it is su�cient to show that Ps hrs P 0(xk 2 Cs) issymmetrical w.r.t. ��1r. This is equivalent toXs hrs P 0(xk 2 Cs) = Xs hrs P 0(2��1r� xk 2 Cs)= Xs hrs exp ��2 Pt hst kxk � ��1(2r� t)k2�Pu exp ��2 Pt hut kxk � ��1(2r� t)k2� : (9.8)From hrs = hkr�sk it follows that hrs = hr(2r�s). Substituting s ! s0 = 2r� s and t ! t0 =2r� t one can writeXs hrs P 0(xk 2 Cs) = Xs0 hrs0 exp ��2 Pt h(2r�s0)t kxk � ��1(2r� t)k2�Pu exp ��2 Pt hut kxk � ��1(2r� t)k2�= Xs0 hrs0 exp ��2 Pt0 hs0t0 kxk � ��1t0)k2�Pu exp ��2 Pt0 hut0 kxk � ��1t0k2�= Xs0 hrs0 P 0(xk 2 Cs0) : (9.9)Thus the probability distribution Qr(xk) is symmetrical w.r.t. wk0r = ��1r and consequentlyR Qr(x)xkdxk = ��1r. Hence (9.5) is correct and (5.10) is a �xed-point of (5.8).9.2 Derivation of the Symmetry Properties of the AssignmentCorrelationsHere it is shown that frs = fkr�sk follows from hrs = hkr�sk. Starting from (5.13) one canexpress frs asfrs = Nn Z P (xk) exp��2 Pt �hkr�tk + hks�tk� xk � ��1t 2��Pu exp ��2 Pt hku�tk xk � ��1t 2��2 dxk : (9.10)Substituting t! t0 = A(t�s), whereA is any non-singular, length-preserving transformationmatrix and using kArk = krk; 8r, one obtainsfrs = Nn Z P (xk) exp��2 Pt0 �hkA(r�s)�t0k + hkt0k� xk � ��1(A�1t0 + s) 2��Pu exp ��2 Pt0 hku�t0k xk � ��1(A�1t0 + s) 2��2 dxk :(9.11)

CHAPTER 9. APPENDIX 71Substituting xk ! xk0 = A(xk � ��1s) and u! u0 = A(u� s) leads tofrs = Nn Z P (A�1xk0 + ��1s) exp��2 Pt0 �hkA(r�s)�t0k + hkt0k� xk0 � ��1t0 2��Pu exp ��2 Pt0 hku�A�1t0+sk xk0 � ��1t0 2��2 dxk0= Nn Z P (A�1xk0 + ��1s) exp��2 Pt0 �hkA(r�s)�t0k + hkt0k� xk0 � ��1t0 2��Pu0 exp ��2 Pt0 hku0�t0k xk0 � ��1t0 2��2 dxk0 :(9.12)Comparing (9.12) to (9.10) it can be seen that frs is a function of A(r� s), if P (A�1xk +��1s) = P (xk). This is the case for the particular choice P (xk) = `�n, and since A can beany length-preserving linear transformation it follows that frs = fkr�sk.9.3 Evaluation of the Assignment Correlation for GaussianNeighborhood FunctionsStarting from (5.13) it is shown how to calculate the approximation of frs as given in (5.21) forthe homogeneous isotropic Gaussian neighborhood function given in (5.20) in the continuumapproximation. Inserting the assignment probabilities P 0(xk 2 Cr), (5.9), for the ground state(5.10) into (5.13) givesfrs = �n Z exp��2 Pt (hrt + hst) xk �wk0t 2��Pu exp��2 Pt hut xk �wk0t 2��2 dxk : (9.13)First, let us evaluate the expression Pt hrtkxk�wktk2 in the continuum approximation withsums replaced by integrals by using the property of the �xed-point wk0t = ��1t from (5.10).This givesXt hrt xk �wkt 2 � � 1p2 ��h�n Z exp �kr� tk22 �2h ! xk � ��1t 2 dt= � 1p2 ��h�n Z exp �kt0k22 �2h ! xk � ��1 �t0 + r� 2 dt0 ; (9.14)for which t! t0 = t+ r. Now the evaluation of the integral is straightforward and yieldsXt hrt xk �wkt 2 � xk �wkr 2 + n ��2�2h : (9.15)Inserting this into (9.13) and observing that the expression exp �� n ��2�2h� appears as afactor in numerator and denominator and thus cancels, one arrives atfrs � �n Z exp��2 � xk � ��1r 2 + xk � ��1s 2��Pu exp ��2 xk � ��1u 2��2 dxk : (9.16)

CHAPTER 9. APPENDIX 72The denominator of the integrand in (9.16) is approximated by Xu exp��2 xk � ��1u 2�!2 � �Z exp��2 xk � ��1u 2�du�2= 2��2� !n ; (9.17)and the numerator of the integrand in (9.16) can be rewritten asexp��2 � xk � ��1r 2 + xk � ��1s 2�� =exp��4 � 2xk � ��1 (r+ s) 2 + ��1 (r� s) 2�� : (9.18)Inserting (9.17) and (9.18) into (9.16) and usingZ exp��4 2xk � ��1 (r+ s) 2� dxk = ��n2one �nally obtains the continuum approximation for frs,frs � f (kr� sk) = s �4 ��2 !n exp�� 4 �2kr� sk2� : (9.19)9.4 Derivation of Mean-�eld EquationsUsing the relations mkrPlPumluhus = mkrPl6=kPumluhus + hrs (9.20)and, from 1=(a+ b) = 1=a� b=(a(a+ b)),1PlPumluhus = 1Pl6=kPumluhus +Pumkuhus= 1Pl6=kPumluhus �Xw hws mkwPl6=kPumluhus(Pl6=kPumluhus + hws) : (9.21)One obtains for the derivative of the averaged cost function hETMPi,@hETMPi@ekv = 12 Xr;s;thrshtsXi;j @@ekv � mirmjtPlPumluhus� dij= 12 Xr;s;thrshts 24Xj 6=k @hmkri@ekv * mjtPl6=kPumluhus + hrs+ dkj+ Xi6=k @hmkti@ekv * mirPl6=kPumluhus + hts+ dik+ �rt@hmkri@ekv * 1Pl6=kPumluhus + hrs+ dii� Xi6=kXj 6=kXw @hmkwi@ekv hws * mjtmirPl6=kPumluhus + hrs+ dij35 : (9.22)

CHAPTER 9. APPENDIX 73With dii = 0 and dij = dji one obtains@hETMPi@ekv = Xr @hmkri@ekv Xs;t hrshtsXj 6=k* mjtPl6=kPumluhus + hrs+� 24dkj �Xi6=kXw hws * miwPl6=kPumluhus + hws+ dij35 ; (9.23)and { comparing equations (9.23) and (7.11) { optimal mean �elds ekr, equation (7.12).

Bibliography[1] M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer. Theoretical foundations ofthe potential function method in pattern recognition learning. Automation and RemoteControl, 25:821{837, 1964.[2] S. Amari. The EM algorithm and information geometry in neural network learning.Neural Computation, 7(1):13{18, 1995.[3] S. Amari. Information geometry of the EM and em algorithms for neural networks.Neural Networks, 8(9):1379{1408, 1995.[4] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins. Learning and sequential decisionmaking. In M. Gabriel and J. W. Moore, editors, Learning and Computational Neu-roscience: Foundations of Adaptive Networks, pages 539{602. MIT Press, Cambridge,MA, 1990.[5] H. U. Bauer, M. Riesenhuber, and T. Geisel. Phase diagrams of self-organizing maps.Physical Review E, 54(3):2807{2810, 1996.[6] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind seper-tation and blind deconvolution. Neural Computation, 7(6):1129{1159, 1995.[7] C. M. Bishop. Neural Network for Pattern Recognition. Clarendon Press, Oxford, 1995.[8] C. M. Bishop, M. Svens�en, and C. K. I. Williams. EM optimization of latent-variabledensity models. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advancesin Neural Information Processing Systems 8, pages 465{471, Cambridge, MA, 1996. MITPress.[9] C. M. Bishop, M. Svens�en, and C. K. I. Williams. GTM: The generative topographicmapping. Neural Computation, 10(1):215{234, 1997.[10] C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visu-alization. Technical Report NCRG/96/028, Neural Computing Research Group, AstonUniversity, 1997.[11] M. Blatt, S. Wiseman, and E. Domany. Data clustering using a model granular magnet.Neural Computation, 9(8):1805{1842, 1997.[12] I. Borg and J. Lingoes. Multidimensional Similarity Structure Analysis, volume 13 ofSpringer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1987.[13] J. M. Buhmann. Data clustering and learning. In M. A. Arbib, editor, The Handbookof Brain Theory and Neural Networks. MIT Press, Cambridge, MA, 1995.

BIBLIOGRAPHY 75[14] J. M. Buhmann and T. Hofmann. A maximum entropy approach to pairwise dataclustering. In Proceedings of the International Conference on Pattern Recognition, vol-ume II, pages 207{212. IEEE Computer Society Press, 1994.[15] J. M. Buhmann and H. K�uhnel. Vector quantization with complexity costs. IEEETransactions on Information Theory, 39:1133{1145, 1993.[16] J. M. Buhmann and T. K�uhnel. Complexity optimized data clustering by competitiveneural networks. Neural Computation, 5(3):75{88, 1993.[17] M. Burger, T. Graepel, and K. Obermayer. Phase transitions in soft topographic vectorquantization. In W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicoud, editors,Arti�cial Neural Networks ICANN'97, volume 1327, pages 619{624, Berlin, Germany,1997. Springer-Verlag.[18] M. Burger, T. Graepel, and K. Obermayer. An annealed self-organizing map for sourcechannel coding. In Advances in Neural Information Processing Systems 10, Cambridge,MA, 1998. MIT Press. (in press).[19] C. J. C. Burges. A tutorial on Support Vector Machines for pattern recognition. Sub-mitted to Data Mining and Knowledge Discovery, 1997.[20] Y. Choe, J. Sirosh, and R. Miikkulainen. Latterally interconnected self-organizingmaps in hand-written digit recognition. In D. S. Touretzky, M. C. Mozer, and M. E.Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 736{442, Cambridge, MA, 1996. MIT Press.[21] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models.Technical Report AI Memo 1522/CBCL Paper 110, Arti�cial Intelligence Laboratory,MIT, 1995.[22] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273{297,1995.[23] R. Courant and D. Hilbert. Methods of Mathematical Physics. Interscience Publishers,Inc., New York, �rst English edition, 1953.[24] I. Csisz�ar. I-divergence geometry of probability distributions and minimization prob-lems. Annals of Probability, 3:146{158, 1975.[25] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1{22, 1977.[26] R. O. Duda and P. E. Hart. Pattern Classi�cation and Scene Analysis. Wiley, NewYork, 1973.[27] A. P. Dunmur and D. M. Titterington. On a modi�cation to the mean �eld EM al-gorithm in factorial learning. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors,Advances in Neural Information Processing Systems 9, pages 431{437, Cambridge, MA,1997. MIT Press.

BIBLIOGRAPHY 76[28] R. Durbin and G. Mitchison. A dimension reduction framework for understandingcortical maps. Nature, 343:644{647, 1990.[29] R. Durbin, R. Szeliski, and A. Yuille. An analysis of the elastic net approach to thetravelling salesman problem. Neural Computation, 1(3):348{358, 1989.[30] R. Durbin and D. Willshaw. An analogue approach to the travelling salesman problemusing an elastic net method. Nature, 326:689{691, 1987.[31] R. P. Feynman, R. B. Leighton, and M. Sands. The Feynman Lectures on Physics.Addison-Wesley Publishing Company, Reading, MA, 1964.[32] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the Bayesianrestoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence,6(6):721{741, 1984.[33] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analysers.Technical Report CRG-TR-96-1, Dept. Computer Science, University of Toronto, 1996.[34] Z. Ghahramani and M. I. Jordan. Learning from incomplete data. Technical Report AIMemo 1509, CBCL 108, Arti�cial Inteligence Lab, MIT, 1994.[35] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an EMapproach. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in NeuralInformation Processing Systems 6, pages 120{127, San Francisco, CA, 1994. MorganKaufmann Publishers.[36] S. Gold and A. Rangarajan. Softassign versus softmax: Benchmarks in combinatorialoptimization. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advancesin Neural Information Processing Systems 8, pages 626{632, Cambridge, MA, 1996.MIT Press.[37] S. Gold, A. Rangarajan, and E. Mjolsness. Learning with preknowledge: Clusteringwith point and graph matching distance measures. Neural Computation, 8(4):787{804,1996.[38] L. Goldfarb. A new approach to pattern recognition. In A. R. L. N. Kanal, editor,Progress in Pattern Recognition, volume 2 of Machine Intelligence and Pattern Recog-nition, pages 241{402. Noth Holland, Amsterdam, 1985.[39] G. J. Goodhill and T. J. Sejnowski. A unifying objective function for topographicmappings. Neural Computation, 9(6):1291{1303, 1997.[40] J. C. Gower. Some distance properties of latent root and vector methods used inmultivariate analysis. Biometrika, 9(53):325{328, 1966.[41] T. Graepel, M. Burger, and K. Obermayer. Deterministic annealing for topographicvector quantization and self-organising maps. In T. Kohonen, editor, Proceedings of theWorkshop on Self-Organising Maps, volume 7 of Proceedings in Arti�cial Intelligence,pages 345{350. In�x, 1997.

BIBLIOGRAPHY 77[42] T. Graepel, M. Burger, and K. Obermayer. Phase transitions in stochastic self-organizing maps. Physical Review E, 56(4):3876{3890, 1997.[43] T. Graepel, M. Burger, and K. Obermayer. Self-organizing maps: Generalizations andnew optimization techniques. submitted to Neurocomputing, 1998.[44] T. Graepel and K. Obermayer. Fuzzy topographic kernel clustering. In W. Brauer,editor, Proceedings of the 5th GI Workshop Fuzzy Neuro Systems '98, pages 90{97,1998.[45] T. Graepel and K. Obermayer. A stochastic self-organizing map for proximity data.submitted to Neural Computation, 1998.[46] G. Hadley. Nonlinear and Dynamic Programming. Addison-Wesley, 1964.[47] S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan College, NewYork, NY, 1994.[48] R. Hecht-Nielsen. Counterpropagation networks. Applied Optics, 26:4979{4984, 1987.[49] R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer. Learning a prefer-ence relation in information retrieval. In Working Notes of the ICML Workshop onInformation Retrieval'98, 1998. (in press).[50] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Compu-tation, volume 1 of Santa Fe Institute Studies in the Science of Complexity. Addison-Wesely, Redwood City, CA, 1991.[51] T. Heskes and B. Kappen. Error potentials for self-organization. IEEE-ICNN, 3:1219{1223, 1993.[52] T. Heskes and B. Kappen. Self-organizing and nonparametric regression. InF. Fogelman-Soulie and P. Gallinari, editors, Arti�cial Neural Networks - ICANN'95,pages 81{86, Paris, Francey, 1995. EC2 & Cie.[53] T. Hofmann and J. Buhmann. Central and pairwise data clustering by competitiveneural networks. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances inNeural Information Processing Systems 6, pages 104{111, San Francisco, CA, 1994.Morgan Kaufmann Publishers.[54] T. Hofmann and J. Buhmann. Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):1{14, 1997.[55] T. Hofmann and J. M. Buhmann. Multidimensional scaling and data clustering. InG. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural InformationProcessing Systems 7, pages 459{466, Cambridge, MA, 1995. MIT Press.[56] T. Hofmann and J. M. Buhmann. An annealed neural gas network for robust vectorquantization. In C. v. d. Malsburg, J. C. Vorbr�uggen, W. v. Seelen, and B. Sendho�,editors, Arti�cial Neural Networks ICANN'96, pages 151{156, Berlin, Heidelberg, 1996.Springer-Verlag.

BIBLIOGRAPHY 78[57] T. Hofmann and J. M. Buhmann. Inferring hierarchical clustering structures by deter-ministic annealing. In E. Simoudis, J. W. Han, and U. Fayyad, editors, Proceedingsof the Second Int. Conf. on Knowledge Discovery and Data Mining (KDD'96), pages363{366. AAAI Press, 1996.[58] T. Hofmann, J. Puzicha, and J. Buhmann. An optimization approach to unsupervisedhierarchical texture segmentation. In Proceedings of the International Conference onImage Processing ICIP'97, Santa Barbara, 1997.[59] T. Hofmann, J. Puzicha, and J. M. Buhmann. Unsupervised segmentation of texturedimages by pairwise data clustering. In Proceedings of the International Conference onImage Processing ICIP'96, pages 137{140, Lausanne, Switzerland, 1996.[60] H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24:417{441, 1933.[61] L. Ingber. Simulated annealing: Practice versus theory. Journal Mathl. ComputationalModelling, 18(11):29{57, 1993.[62] E. T. Jaynes. Information theory and statistical mechanics. Physical Review,106(4):620{630, 1957.[63] I. T. Jolli�e. Principal Component Analysis. Springer Series in Statistics. Springer-Verlag, Berlin, Heidelberg, 1986.[64] M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures of expertsarchitectures. Neural Networks, 8(9):1409{1431, 1995.[65] M. J. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181{214, 1994.[66] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.Science, 220(4598):671{680, 1983.[67] H. Klock and J. M. Buhmann. Multidimensional scaling by deterministic annealing.In M. Pelillo and E. R. Hancock, editors, Energy Minimization Methods in ComputerVision and Pattern Recognition, volume 1223, pages 246{260, Berlin, Heidelberg, 1997.Springer-Verlag.[68] M. Kloppenburg and P. Tavan. Deterministic annealing for density estimation by mul-tivariate normal mixtures. Physical Review E, 55(3):2089{2092, 1997.[69] T. Kohonen. Analysis of a simple self-organizing process. Biological Cybernetics,43(1):135{140, 1982.[70] T. Kohonen. Self-organized formation of topological correct feature maps. BiologicalCybernetics, 43(1):59{69, 1982.[71] T. Kohonen. Self-Organization and Associative Memory. Springer Series in InformationSciences. Springer-Verlag, Berlin, Heidelberg, 3nd edition, 1988.[72] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, Heidelberg, 1995.

BIBLIOGRAPHY 79[73] J. B. Kruskal. Linear transformation of multivariate data to reveal clustering. In Mul-tidimensional Scaling: Theory and Application in the Behavioral Sciences, I, Theory,New York and London, 1972. Seminar Press.[74] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantizer design. IEEETransactions on Communications, 28(1):84{95, 1980.[75] S. P. Luttrell. Hierarchical vector quantisation. In IEE Proceedings Part I, volume 136,pages 405{413, 1989.[76] S. P. Luttrell. Derivation of a class of training algorithms. IEEE Transactions on NeuralNetworks, 1(2):229{232, 1990.[77] S. P. Luttrell. Code vector density in topographic mappings: Scalar case. IEEE Trans-actions on Neural Networks, 2(4):427{436, 1991.[78] S. P. Luttrell. Self-supervised training of hierarchical vector quantisers. In IEE Inter-national Conference on Arti�cial Neural Networks, 1991.[79] S. P. Luttrell. A Bayesian analysis of self-organizing maps. Neural Computation,6(5):767{794, 1994.[80] D. J. C. MacKay. Information-based objective functions for active data selection. NeuralComputation, 4(4):586{603, 1992.[81] J. MacQueen. Some methods for classi�cation and analysis of multivariate observations.In L. M. LeCam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposiumon Mathematical Statistic and Probability, pages 281{297, Berkely, CA, 1967. Universityof California Press.[82] E. C. Malthouse. Some theoretical results on nonlinear principal component analysis.submitted to IEEE Transactions on Neural Networks, 1996.[83] G. J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley Series inProbability and Statistics. John Wiley & Sons, Inc., New York, NY, 1997.[84] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equa-tions of state calculations by fast computing machines. Journal of Chemical Physics,21:1087{1092, 1953.[85] D. Michie, D. H. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Sta-tistical Classi�cation. Series in Arti�cial Intelligence. Ellis Horwood, 1994.[86] D. Miller and K. Rose. Combined source-channel vector quantization using deterministicannealing. IEEE Transactions on Communications, 42(2/3/4):347{356, 1994.[87] D. Miller and K. Rose. Hierarchical, unsupervised learning with growing via phasetransitions. Neural Computation, 8(2):425{450, 1996.[88] D. J. Miller, A. V. Rao, K. Rose, and A. Gersho. A global optimization technique forstatistical classi�er design. IEEE Transactions on Signal Processing, 44(12):3108{3122,1996.

BIBLIOGRAPHY 80[89] F. Mulier and V. Cherkassky. Self-organization as an iterative kernel smoothing process.Neural Computation, 7(6):1165{1177, 1995.[90] N. Murata, K.-R. M�uller, A. Ziehe, and S. Amari. Adaptive on-line learning in changingenvironments. In M. C.Mozer, M. I. Jordan, and T. Petsche, editors, Advances in NeuralInformation Processing Systems 9, pages 599{605, Cambridge, MA, 1997. MIT Press.[91] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justi�es incrementaland other variants. Technical report, Department of Computer Science, University ofToronto, 1993.[92] K. Obermayer, G. G. Blasdel, and K. Schulten. Statistical-mechanical analysis of self-organization and pattern formation during the developement of visual maps. PhysicalReview A, 45(10):7568{7589, 1992.[93] K. Obermayer, H. Ritter, and K. Schulten. Large-scale simulations of self-organizingneural networks on parallel computers: Application to biological modelling. ParallelComputing, 14:381{404, 1990.[94] K. Obermayer, H. Ritter, and K. Schulten. A principle for the formation of the spatialstructure of cortical feature maps. Proceedings National Academy of Science U.S.A.,87:8345{8349, 1990.[95] L. Parra, G. Deco, and S. Miesbach. Statistical independence and novelty detectionwith information preserving nonlinear maps. Neural Computation, 8(2):260{269, 1996.[96] R. E. Peierls. On a minimum property of the free energy. Physical Review, 54:918, 1938.[97] J. Puzicha, T. Hofmann, and J. M. Buhmann. Deterministic annealing: Fast physi-cal heuristics for real time optimization of large systems. In Proceedings of the 15thIMACS World Conference on Scienti�c Computation, Modelling and Applied Mathe-matics, 1997.[98] A. Rao, D. Miller, K. Rose, and A. Gersho. Mixture of experts regression modeling bydeterministic annealing. IEEE Transactions on Signal Processing, accepted for publi-cation, 1997.[99] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press,Cambridge, U.K., 1st edition, 1996.[100] H. Ritter and K. Schulten. Convergence properties of kohonen's topology conservingmaps: Fluctuations, stability, and dimension selection. Biological Cybernetics, 60:59{71,1988.[101] H. J. Ritter, T. Martinetz, and K. J. Schulten. Neural Computation and Self-OrganizingMaps: An Introduction. Addison-Wesley, Reading, MA, 1992.[102] H. Robbins and S. Monroe. A stochastic approximation method. Annals of Mathemat-ical Statistic, 22:400{407, 1951.[103] K. Rose, E. Gurewitz, and G. C. Fox. Statistical mechanics and phase transitions inclustering. Physical Review Letters, 65(8):945{948, 1990.

BIBLIOGRAPHY 81[104] K. Rose, E. Gurewitz, and G. C. Fox. Vector quantization by deterministic annealing.IEEE Transactions on Information Theory, 38(4):1249{1257, 1992.[105] K. Rose, E. Gurewitz, and G. C. Fox. Constrained clustering as an optimization method.IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(8):785{794, 1993.[106] S. Roweis. EM algorithms for PCA and SPCA. In Advances in Neural InformationProcessing Systems 10, Cambridge, MA, 1998. MIT Press. (in press).[107] L. K. Saul and M. I. Jordan. Exploiting tractable substructures in intractable networks.In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in NeuralInformation Processing Systems 8, pages 486{492, Cambridge, MA, 1996. MIT Press.[108] J. W. Scannell, C. Blakemore, and M. P. Young. Analysis of connectivity in the catcerebral cortex. The Journal of Neuroscience, 15(2):1463{1483, 1995.[109] F. Schl�ogl. Speci�c heat as a general statistical measure. Zeitschrift f�ur Physik, 267:77{82, 1973.[110] B. Sch�olkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. InU. M. Fayyad and R. Uthurusamy, editors, First International Conference on KnowledgeDiscovery and Data Mining, pages 252{257, Menlo Park, CA, 1995. AAAI Press.[111] B. Sch�olkopf, A. Smola, and K.-R. Mller. Nonlinear component analysis as a kerneleigenvalue problem. Neural Computation, 10(1), 1998.[112] B. Sch�olkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik.Nonlinear component analysis as a kernel eigenvalue problem. Technical Report 44,MPI f�ur biologische Kybernetik, T�ubingen, 1996.[113] S. A. Shumsky and A. V. Yarovoy. Neural network analysis of russian banks. InT. Kohonen, editor, Proceedings of the Workshop on Self-Organising Maps, pages 351{355, 1997.[114] P. D. Simi�c. Statistical mechanics as the underlying theory of 'elastic' and 'neural'optimisation. Network, 1:89{103, 1990.[115] P. D. Simi�c. Constrained nets for graph matching and other quadratic assignmentproblems. Neural Computation, 3(2):268{281, 1991.[116] S. Singh and D. Bertsekas. Reinforcement learning for dynamic channel allocation incellular telephone systems. In Advances in Neural Information Processing Systems 10,Cambridge, MA, 1997. MIT Press.[117] R. S. Sutton. Learning to predict by the methods of temporal di�erences. MachineLearning, 3:9{44, 1988.[118] G. Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-levelplay. Neural Computation, 6(2):215{219, 1994.[119] Y. Tikochinsky, N. Z. Tishby, and R. D. Levine. Alternative approach to maximum-entropy inference. Physical Review A, 30(5):2638{2644, 1984.

BIBLIOGRAPHY 82[120] M. E. Tipping and C. M. Bishop. Mixtures of principal component analysers. TechnicalReport NCRG/97/003, Aston University, Birmingham B4 7ET, U.K., 1997.[121] M. E. Tipping and C. M. Bishop. Mixtures of principal component analysers. TechnicalReport NCRG/97/003, Aston University, Birmingham B4 7ET, U.K., 1997.[122] V. Tresp, S. Ahmad, and R. Neuneier. Training neural networks with de�cient data.In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Informa-tion Processing Systems 6, pages 128{135, San Francisco, CA, 1994. Morgan KaufmanPublishers.[123] A. Utsugi. Topology selection for self-organizing maps. Network: Computation in NeuralSystems, 7(4):727{740, 1996.[124] A. Utsugi. Hyperparameter selection for self-organizing maps. Neural Computation,9(3):623{635, 1997.[125] L. G. Valiant. A theory of the learnable. Communications of ACM, 11(27):1134{1142,1984.[126] P. van de Larr and B. Kappen. Boltzmann machines and the EM algorithm. Technicalreport, Department for Medical Physics and Biophysics, University of Nijmegen andFoundation for Neural Networks (SNN), 1994.[127] V. Vapnik. The Nature of Statistical Learning. Springer-Verlag, Berlin, Heidelberg,Germany, 1995.[128] C. F. J. Wu. On the convergence properties of the EM algorithm. The Annals ofStatistics, 11(1):95{103, 1983.[129] L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussianmixtures. Neural Computation, 8(1):129{151, 1996.[130] A. L. Yuille. Generalized deformable models, statistical physics, and matching problems.Neural Computation, 2(1):1{24, 1990.[131] A. L. Yuille and J. J. Kosowsky. Statistical physics algorithms that converge. NeuralComputation, 6(3):341{356, 1994.[132] A. L. Yuille, P. Stolortz, and J. Utans. Statistical physics, mixtures of distributions,and the EM algorithm. Neural Computation, 6(2):334{340, 1994.

pdfs.semanticscholar.org€¦ · ac kno wledgmen ts it is m y pleasure to thank prof. klaus ob erma...

Documents