the simultaneous recurrent neural network for...
TRANSCRIPT
Page 1 of 27
THE SIMULTANEOUS RECURRENT NEURAL NETWORK
FOR ADDRESSING THE SCALING PROBLEM
IN STATIC OPTIMIZATION
Gursel Serpen, Amol Patwardhan, and Jeff Geib
Authors are with the Electrical Engineering and Computer Science Department of the
University of Toledo, Toledo, OH 43606, USA. Voice (419) 530 8158. Fax (419) 530 8146
and E-mail: [email protected].
Abstract
A trainable recurrent neural network, Simultaneous Recurrent Neural network, is proposed to
address the scaling problem faced by neural network algorithms in static optimization. The
proposed algorithm derives its computational power to address the scaling problem through
its ability to "learn" compared to existing recurrent neural algorithms, which are not
trainable. Recurrent backpropagation algorithm is employed to train the recurrent,
relaxation-based neural network in order to associate fixed points of the network dynamics
with locally optimal solutions of the static optimization problems. Performance of the
algorithm is tested on the NP-hard Traveling Salesman Problem in the range of 100 to 600
cities. Simulation results indicate that the proposed algorithm is able to consistently locate
high-quality solutions for all problem sizes tested. In other words, the proposed algorithm
scales demonstrably well with the problem size with respect to quality of solutions and at the
expense of increased computational cost for large problem sizes.
Keywords: Simultaneous Recurrent Neural Network, Recurrent Backpropagation,
Combinatorial Optimization, Large Scale Problem, Traveling Salesman
Page 2 of 27
Introduction
Higher-level intelligent systems typically possess ability for dynamic optimization. Another
very significant element of a higher-level intelligent system is the ability to "learn".
Intelligent systems that can effectively learn and, through which, acquire higher levels of
intellectual capacities are more likely to successfully address the challenges inherent in
stochastic real-world environments. Currently, Genetic Algorithms and heuristic-based
search algorithms, neither of which is learning-based, offer the most promising approaches to
address large-scale static optimization problems within reasonable computational cost limits
[Chellapilla and Fogel, 1997; Grotschel and Holland, 1991]. However, their inability to
adapt through learning seriously diminishes their utility for deployment as a building block in
static optimization, to which a dynamic optimization problem leads.
Artificial neural networks (ANN), as static optimizers for large-scale problems and as
systems that can be trained, are real contenders against existing non-learning search
algorithms since ANNs possess the potential to contribute towards solving challenging
dynamic optimization problems. However, a survey of literature indicates that among
significant studies reported, practically all artificial neural network algorithms, which were
applied to static optimization problems, were preprogrammed [Smith et al., 1998; Cichocki,
1993]: no training procedure was applied to adapt weights of the networks.
The Hopfield network (HN) and its derivatives are perhaps the most widely used ANN
algorithms that address static optimization problems; they topologically belong to the class of
single-layer, relaxation-type recurrent ANNs. The HN derivatives rely on gain scheduling as
in simulated annealing, network with nodes modeled by lossless integrators, network of
Page 3 of 27
nodes with unipolar activation functions, network with additive uncorrelated noise with zero
mean and a variance gradually decreasing in time, mean-field theory network, and mean-field
annealing network, among others.
Studies for particularly large-scale static optimization problems using artificial neural
networks, especially the HN and its derivatives, are scarce in the literature [Smith et. al.,
1998; Matsuda et. al., 1998; Gall et. al., 1999]. In the case of the TSP, most reported studies
consider the 100-city problem as the norm to demonstrate the validity of the research
findings. This creates a reasonable degree of uncertainty if the proposed algorithms, which
deliver good performance for 100 or 200 cities, will be able to scale well for much larger
problem sizes. Specifically, maintaining the quality of solutions consistent as the problem
size increases is of primary interest and concern when assessing the scalability property of a
given neural optimizer algorithm.
A typical search session for the HN and its derivatives is likely to include numerous
relaxations. A relaxation starts with initialization of the parameters associated with network
dynamics, continues with updates of dynamic system states/outputs, and concludes upon
convergence of network states/outputs to a stable equilibrium point, assuming at least one
exists. After each unsuccessful relaxation, the network is simply reinitialized for the next
relaxation, ignoring the experience associated with the already completed relaxation. The
HN and its derivatives lack a mechanism to incorporate the experience gained following the
previous relaxation cycles. In other words, they do not employ any learning that allows the
Page 4 of 27
neural search algorithm to benefit from the results of prior relaxations, since weights are
preprogrammed and not modified afterwards.
A learning-based recurrent neural search algorithm is expected to offer significant
performance improvements over a non-learning based algorithm. One such neural paradigm,
the Simultaneous Recurrent Neural network (SRN) [Werbos, 1992 and 1994; Pang and
Werbos, 1997] incorporates powerful features: it is a recurrent algorithm with relaxation
search capability, while also being trainable. The Simultaneous Recurrent Neural network
has the potential to develop, through "learning", the ability to address the computationally
challenging task of large-scale static optimization. This forms a very important first step
towards eventually addressing dynamic optimization problems through algorithms that can
formulate learning-based solutions.
This study proposes using a learning-based artificial neural network algorithm, the
Simultaneous Recurrent Neural network, to address large-scale static optimization problems.
Existing neural optimizer algorithms are either trainable feedforward architectures or
recurrent architectures with preprogrammed weight structures. The novel contribution of this
study is using a neural network that is both trainable, and at the same time recurrent, to
address large-scale static optimization problems.
Simultaneous Recurrent Network
A Simultaneous Recurrent Network (SRN) is an artificial neural network [Werbos, 1994;
Pang and Werbos, 1997] with the graphical representation as in Figure 1.
Page 5 of 27
Figure 1. Simultaneous Recurrent Network Graphical Representation.
The system has external inputs in the form of a vector x, a feed-forward vector function F,
(any feed-forward network, including the multi-layer perceptron, is appropriate), outputs in
the form of a vector z, and a feedback path which copies the outputs to inputs without a time
delay.
The feed-forward network F will also induce a weight matrix W, which represents the
interconnection topology of the network. The network, starting from an initial state as
indicated by the initial value of the output vector, will iterate until the output vector z
stabilizes and converges to a stable point given that one exists. In other terms, an SRN is
based on a feed-forward network with simultaneous feedback from outputs of the network to
its inputs. An SRN exhibits complex temporal behavior: it follows a trajectory in the state
space to relax to a fixed point. One "relaxation" of the network consists of one or more
iterations of output computation and propagation along the feed-forward and feedback paths
until the outputs converge to a stable equilibrium value.
Feedforward
Mapping
F(z, x,W)
Output - z
Feedback Path
Input - x
Page 6 of 27
A more formal description of SRN is formulated in [Werbos; 1992] who defines an SRN as a
mapping
)(ˆ Wx,Fz = (1a)
where x and W are the external inputs and weights, respectively and z is the equilibrium
value of z e.g.:
( ) ,limˆ n
nzz
∞→= (1b)
which can be computed by the following iteration
Fz =+ )1(n (z(n)
,x,W), (1c)
where F is a feedforward network and n is the iteration index with very fast computation
cycles compared to feedback delays found in time-lagged recurrent networks.
The network is provided with the external inputs and initial outputs, which are typically
assumed randomly in the absence of a priori information. The output of previous iteration is
fed back to the network along with the external inputs to compute the output of next iteration.
The network is allowed to iterate until it reaches a stable equilibrium point, assuming at least
one exists. External inputs are applied throughout the complete relaxation cycle. When a
stable equilibrium point is reached, the outputs stop changing (i.e. the output value of )1( +nz
is almost equal to or very close to )(nz ). It is important to note that the feedback from the
output layer to the input layer in SRN is not delayed: the feedback is, theoretically speaking,
simultaneous.
Page 7 of 27
SRN As A Static Optimizer for the TSP
The Traveling Salesman Problem (TSP) was chosen as the benchmark for the performance
evaluation of the SRN because it is representative of NP-hard optimization problems. In the
TSP, a salesman spends his time visiting N cities (or nodes) cyclically. In one tour, he visits
each city just once, and concludes where he starts. The goal is to find the order he should
visit the cities to minimize the total distance traveled. Selection of the TSP as the benchmark
is appropriate because almost all non-learning neural search algorithms fail to deliver
acceptable quality solutions with reasonable computational cost and time for large-scale
variants of this problem.
SRN Topology for the TSP
An N-city TSP is represented by an N×N array, where each row represents a different city
and each column represents the possible positions of the cities in the path. In order to
represent an N-city TSP using the SRN, the feedforward network F in the SRN consists of
two layers. The output layer is an N×N matrix of nodes, with each row representing a city
and each column representing a possible position in the path. Additionally, there is a single
layer of hidden nodes. In the TSP, the inputs to the problem are the distances between the
cities, represented by the cost matrix. As presented in the next section, the cost matrix is
used in the calculation of the error function of the training algorithm. Since the cost matrix is
used as an input to the error function, it does not need to also be included as an external input
to the SRN. Therefore, in the case of the TSP the external inputs x in Figure 1 are not
applied to the network.
Page 8 of 27
The SRN for the TSP consists of a two-layer network with a relatively small number of
hidden nodes and an N×N array of output nodes, with a recurrent connection between the
nodes in the output layer and the hidden layer. Since no external inputs exist, the network is
simply initialized with small random values for the weights and the outputs z and allowed to
relax. Once the network converges to a fixed point, the solution computed by the network is
considered to have materialized. It is of no relevance the trajectories the neural network
dynamics follow in the phase space. The fixed points/stable equilibrium points the
trajectories lead to are the computationally useful entities for the TSP. After the outputs of
the network converge to a fixed point, the outputs can be compared against problem-specific
error function and the weights modified using a suitable learning algorithm.
Figure 2 shows the architecture of the SRN for the TSP, where the feedforward network F
consists of one hidden layer and one output layer. The SRN has trainable connections from
each neuron of the hidden layer to each neuron of the output layer, which is represented with
the forward weight matrix P. The simultaneous recurrent connections, which are also
trainable, exist from each node in the output layer to each node in the hidden layer, and are
represented by the backward weight matrix V. There are no connections among the nodes in
either the hidden layer or the output layer. All of the connections in the network are
trainable. The outputs z of the network are taken from each output layer neuron. As
discussed above, the external inputs x are not necessary.
Page 9 of 27
Figure 2. SRN Architecture for the TSP.
Neuron dynamics for the output layer of SRN are represented by
�=
+−=J
j
jiji
i ypsdt
ds
1
and
)( ii sfz = for ,,,2,1 NNi ×= � (2a)
where jy is the output of j-th neuron in hidden layer, J is the node count in hidden layer, ijp
is the forward weight from j-th neuron in hidden layer to i-th neuron in output layer, N×N is
the dimensions of the output array, and f is continuous, differentiable function typically a
sigmoid with a steep slope (for combinatorial optimization problems). Similarly, for a
neuron jy in the hidden layer, the dynamics is defined by
�×
=
+−=NN
i
ijij
jzvs
dt
ds
1
and
)( jj sfy = for Jj ,,2,1 �= , (2b)
where iz is the output of i-th neuron in output layer, J is the node count in hidden layer, jiv
is the backward weight from i-th neuron in output layer to j-th neuron hidden layer, N×N is
the dimensions of the output array, and f is a continuous and differentiable function, typically
a sigmoid with a steep slope.
P
V
J-Node Hidden Layer N×N Node Output Layer
Page 10 of 27
Training the SRN with the RBP
In order to train the SRN, it is necessary to define a measure of the error for the output
neurons. The error function needs to ensure valid solutions, as well as a minimum path
length. Certain constraints need to be in place to ensure a valid solution. Given the problem
representation presented above, each row and column in the N×N output array must have
exactly one neuron active: the output value of an active neuron should be close to 1.0 while
the output of each inactive neuron in the N×N array must approach the limiting value of 0.0.
The traveling salesman should travel all cities at least and only once. Thus, when network
converges to a solution, there should be exactly one node active per each row and per each
column. This constraint can be implemented using inhibition among the nodes in a given
row and column. The error term for the column constraint is defined by
( )�� �= = =
��
���
�∞−=
N
i
N
j
N
m
mj
col
col zgE1 1
2
1
1 , (3a)
where i and j are the indices for rows and columns, respectively, m is the index for rows of
the network, ( )∞mjz is the stable value of mj-th neuron output upon convergence to a fixed
point, and gcol
is a positive real weight parameter. When each column of the output matrix
has exactly one active neuron, this error term will be zero. The first summation over the
indexing variable i is included because the error function needs to be defined for each neuron
in the output layer.
Similarly, the error term for the row constraint is given by
Page 11 of 27
( )�� �= = =
��
���
�∞−=
N
i
N
j
N
n
in
row
row zgE1 1
2
1
1 ������������������������������������������������������
where, i and j are the indices for rows and columns, respectively, of the network, n is the
index for columns and grow
is a positive real weight parameter. This error term will have a
value of zero when each row of the output matrix has exactly one active neuron. Again, the
second summation over the index variable j is included since the error function needs to be
defined for every ij-th neuron in the output layer.
An error term is also introduced that forces the neuron outputs to limiting values of 0.0 or 1.0
as
( )( )[ ]��= =
+−∞−=N
i
N
j
ij
bin
bin zgE1 1
2βα , (3c)
where α and β are constants and bing is the positive real weight parameter for this constraint.
This error term is a downward opening quadratic function. By choosing values of α = 0.5
and β = 0.25, the zeros of the function are at 0.0 and 1.0. Thus, this error term has a
minimum value of zero when all of the neurons in the output layer have their output values at
either 0.0 or 1.0.
The error term associated with the distance between the cities can be formulated as
( ) ( )���= = =
+ ∞∞=N
i
N
j
N
m
imjmij
dis
dis dzzgE1 1 1
)1( , (3d)
where dim is the cost associated with the path from city i to city m and gdis
is the positive real
weight parameter for this constraint. For each neuron zij, the index m searches each neuron in
the (j+1)st column, indicated by the zm(j+1) term. If both neurons are active, the distance from
Page 12 of 27
city i to city m, dim, will be included in this error term, where the minimum value is achieved
if the total distance of the path is minimum.
The total error function E is the sum of each individual error terms defined by
disbinrowcol EEEEE +++= . (3e)
In order to converge on a valid and good (with minimum total distance) solution for the TSP,
the state space portrait of the SRN must be changed by moving the fixed points towards
preferably good solutions of the TSP. This is accomplished by altering the weights of the
network using a training algorithm. To train the SRN, recurrent backpropagation (RBP), a
variant of the traditional backpropagation algorithm, is used. This is a gradient descent
learning method for recurrent neural networks, which reshapes the state space portrait of the
network based on a defined error measure.
The full derivation of the RBP algorithm can be found in [Pineda, 1987; Werbos, 1988]. The
RBP training algorithm requires an adjoint network, which is topologically identical to the
SRN except all signal directions reversed, to be set up and relaxed to compute updates for the
weights of the SRN. The adjoint network accepts the error, which is computed using the
stable values of neurons in the output layer of the SRN upon convergence to a fixed point, as
external inputs to neurons in the input layer, which is the output layer for the SRN.
The RBP training algorithm for the SRN is implemented as follows. Upon convergence of
the SRN dynamics to a fixed point, error values for output nodes need to be computed. The
error for an output node is computed by the following formula:
Page 13 of 27
( )∞−= iii ze τ , (4)
where ( )∞iz is the stable output value of i-th neuron in the output layer upon convergence to
a fixed point with NNi ×= ,,2,1 � and �i is the desirable value of the i-th neuron output.
Next, an adjoint network, the topology of which is identical to that of SRN with all signal
directions reversed, i.e., ij-th element of the forward/backward weight matrix for the adjoint
network is equal to ji-th element of the forward/backward weight matrix of the SRN,
respectively, and output/input layers relabeled as input/output, i.e., zy ←* and yz ←* ,
where starred vectors/matrices are associated with the adjoint network, is set up with the
following linear dynamics:
�×
=
+−=NN
i
ijij
jypz
dt
dz
1
***
*
for Jj ,,2,1 �= and
�=
++−=J
j
ijiji
i ezvydt
dy
1
***
*
for NNi ×= ,,2,1 � (5)
for the neurons in the output layer (an 1×J array) and input layer (an N×N array) of the
adjoint network, respectively, while noting that
ijji pp =* and jiij vv =* .
Following the derivation in the Appendix, error terms for each neuron in the output layer is
computed by
( ) ( ) ( )( ) ( )�
��
∞+∞−+�
��
−∞+�
��
−∞= ���
=+
==
N
m
rmqm
dis
qr
binN
n
qn
rowN
m
mr
col
qr zdgzgzgzge1
)1(
11
21212 α
(6)
where q and Nr ,,2,1 �= and i, the index for the output layer neurons, is related to the row
index q and column index r by ( ) rNqi +−= 1 .
Page 14 of 27
Noting that local stability of the SRN dynamics is a sufficient condition for the convergence
of the adjoint network dynamics [Pineda, 1987; Almeida, 1987], once the adjoint network
converges, weight updates can be computed as
( )[ ] ( ) ( )∞∞∞′=∂
∂−=∆ jii
ij
ij yysfp
Ep
*ηη and
( )[ ] ( ) ( )∞∞∞′=∂
∂−=∆ ijj
ji
ji zzsfv
Ev
*ηη
for the forward and backward weight matrix entries, respectively, where η is the learning rate
and f ′ is the derivative of the function f.
Parameter Definitions for SRN and RBP
Many different variables exist in the SRN and the RBP training algorithm that can affect the
ability of the network to converge, the speed of finding a solution, the ability to find a
solution, and the quality of the solution to the TSP, among others. Within the architecture of
the SRN itself, several choices have to be made. The structure of the network F can be any
feedforward network. In this case, a multi-layer perceptron (MLP) with one hidden layer and
output layer, which corresponds to a minimal topology, was chosen. For both layers of the
SRN, neuron activation functions were modeled with unipolar continuous sigmoid in the
range of [0.0,1.0] with a relatively large steepness value of 100.
Another empirically determined parameter is the number of nodes in the hidden layer, which
can have a drastic effect on the speed of convergence of the network. A twofold increase in
the number of hidden nodes will double the number of weights from the hidden layer to
output layer and from the output layer to hidden layer as well, which also doubles the
Page 15 of 27
memory requirements. This in turn increases the number of calculations required as well as
leading to longer relaxation times along with larger number of relaxation counts to locate a
solution. The choice of five hidden layer nodes provided reasonable relaxation times and
counts to compute a solution while keeping the memory requirements manageable.
The weights and outputs of the SRN also need to be initialized. Small random numbers,
uniformly distributed in the interval [–0.2,0.2], were used to initialize the two weight
matrices, P and V. The outputs of the SRN were initialized to uniformly distributed random
values in the interval [0.0,1.0]. Once the training began, the outputs were not re-initialized
subsequent to each relaxation during the training: simply the previous outputs were used.
The error function includes several weight parameters that affect the scaling of each
individual error term. It was empirically determined that the precise values of these
parameters had little effect on the network, but the ratio between the parameters did. For
example, the normalized row and column error terms were roughly 10 to 50 times larger than
the distance error term at the start of training. In order to keep the row and column terms
from monopolizing the total error, the distance weight parameter was initialized to be 10
times larger than other weight parameters.
Additionally, the weight parameters need to be incremented during the training process. If
the parameters are not incremented, it was found that the training algorithm fails to find a
solution. After a number of iterations, the training fails to provide improved results. By
incrementing the weight parameters every so many iterations, the algorithm does eventually
converge on a solution. Incrementing the weight parameters introduces two additional
variables to the problem: the amount of the increment, and the frequency of increment.
Page 16 of 27
Incrementing the parameters by small values, on the order of 0.004, provided good results.
Incrementing the parameters every five relaxations provided the quickest convergence
towards a solution. Incrementing the parameters more frequently sped up the algorithm, but
five relaxations was found to be the lower limit on this variable before the algorithm again
failed to converge.
The values of the parameters can have a drastic effect on the quality of solution and the time
it takes to converge. If too much emphasis is placed on the row and column error terms, a
solution will be found quickly, but the path length will only be average. If too much
emphasis is placed on the distance term, the path distance may become very small, but a
valid path that satisfy the row and column constraints will not be found after very many
iterations. The values in Table 1 provided good results for problems of all sizes tested.
Initial Value Increment
gcol
0.003 0.004
grow
0.003 0.004
gbin
0.003 0.004
gdis
0.010 0.020
Table 1. Error Function Parameter Values and Increments.
When to stop the training of the network is another important question. It is not known what
the best solution is; therefore, it is not possible to stop training when a best solution is
reached. One measure is to stop training when a valid solution is reached, even though it
may not be the best or even a good solution. This is the only criterion available to use,
however. It is important to keep in mind that further training past this point could achieve a
better solution, especially depending on the weight parameter values.
Page 17 of 27
Simulation Study Results
Simulations were performed for problems sizes of 100 to 600 cities in increments of 100
cities. The distances between the cities, or the cost matrix, were initialized with uniformly
random numbers in the interval [0.0,1.0]. It is noted that a random choice of a path should
result in an average distance of 0.5 between any two neighboring cities. The diagonal
elements of the cost matrix were set to 1.0, which represents the maximum distance, to
prevent looping from one city to back to itself. The simulations were performed on a Sun
Ultra 10 Workstation with dual 300MHz CPUs and 1.2 GB RAM, running SunOS� 5.7.
Table 2 below shows the simulation results. To determine the computational time, the UNIX
command timex was used. From this, the total user time and the total system time were
added together. The sum of these values gives the total CPU time used by the process, which
is invariant to other processes running in the background.
Table 2. SRN Simulation Results for the TSP.
There are two important observations that can be made based on the data in Table 2: the
quality of solutions found for any problem size is comparably much better than that of
expected value of a randomly chosen solution, and the quality of solutions does not
deteriorate as the problem size is increased from 100 to 600 cities. The significant
Number
of Cities
Average
Distance
Between Cities
Average
Number of
Relaxations
Average
Computation
Time (minutes)
100 0.24 650 10
200 0.30 1700 135
300 0.26 3000 634
400 0.27 3200 1070
500 0.26 3500 1550
600 0.29 4010 2409
Page 18 of 27
implication is that the SRN is a "good" neural optimizer for its ability to compute high-
quality solutions and that the SRN also offers ability to scale up with the increases in the
problem size in terms of its ability to deliver consistently high quality solutions.
Comparative Performance Assessment
The TSP was extensively studied in the combinatorial optimization field as a benchmark
problem. Non-neural search algorithms employing heuristics report success for large-scale
problems for both the quality of solutions and computational efficiency [Shutler, 2001;
Grotschel and Holland, 1991]. Genetic Algorithms (GA) are among the most
computationally efficient and effective (computational promise to locate nearly global
optima) solution algorithms for the TSP [Chellapilla and Fogel, 1997]. Concurrent efforts to
solve the TSP using neural algorithms employed Hopfield recurrent networks and its
stochastic derivatives, i.e. Mean-Field Annealing, and Boltzmann Machine, as well as
derivatives of self-organizing neural networks [Gee and Prager, 1995]. One important
common feature, or shortcoming, of all these algorithms is the fact that they are all non-
learning: they do not have the ability to improve their performance based on the experience.
Furthermore, in the case of heuristics-based search algorithms, these algorithms are not
generalized search algorithms since heuristics are often problem-specific for maximum
utility. Similar observations apply to self-organizing neural networks since these algorithms
are highly specialized for a given problem without much potential for general applicability to
a class of problems.
The Hopfield network and its derivatives are frequently used to address optimization
problems. The Hopfield network is a single layer, fully connected relaxation-type network.
Page 19 of 27
In previous studies [Serpen et. al., 2000; Serpen et. al., 1997; Gee & Prager, 1995; Smith et.
al., 1998], its application to even small-scale TSPs produced mostly average-quality
solutions. As the problem size increased to even 100 cities, the solution quality tended to
average more markedly indicating the inability of the Hopfield network to scale to even
moderately large problems.
The Boltzman Machine is a version of the Hopfield network with a stochastic search
component. The Boltzman Machine suffers from the excessive memory requirement for
large-scale problems. For an N-city problem, the network will require N2 neurons and N
4
weights since each neuron is connected to every other neuron. For a 1,000-city problem, this
is 1,000,000 neurons and 1012
weights. This number of weights requires prohibitively large
memory storage, in addition to tremendous computational power to allow this network to
relax following a sequential annealing schedule and calculate the weight updates in a
reasonable amount of time unless hardware realization of the algorithm becomes feasible.
Given the memory and computational time requirements of the Boltzmann Machine,
simulation of the algorithm is not practical to empirically assess its true computational
promise for large-scale problems as also the scarcity of such simulations reported in the
published literature indicates [Gee and Prager, 1995, Smith et al., 1998].
Evolutionary computing is most likely the foremost field for computing the global optimum
of a function [Werbos, 1999]. In order to facilitate a performance comparison between the
SRN and the Genetic Algorithm, the GA was applied to the same TSP set [Geib, 2000]. The
software used was the GAlib� Genetic Algorithm package version 2.4.5 [Wall, 2001],
Page 20 of 27
which was instantiated to implement a steady-state genetic algorithm, with 1% of the
population replaced each generation. An ordered list of cities was used as the genome, and
the genetic operator was an edge recombination crossover operator. Partial match crossover
was also tried, but performed poorly. The population size was specified as 100. The
population was allowed to evolve until the best solution from two consecutive populations
was within a specified error tolerance of 0.01. Overall, the GA was able to compute better
quality solutions than the SRN/RBP algorithm at a much less computational cost for up to
600 cities for the TSP [Geib, 2000]: the GA was able to locate solutions with normalized
total distance in the range [0.7,0.15] for the same TSP instances. Although the GA
performed well for the relatively large-scale TSPs, a recent conjecture [Werbos, 1999]
suggests that the GA is not likely to handle problems with a very large number of variables
due to its lack of ability to adapt through learning.
Conclusions
The Simultaneous Recurrent Neural network with Recurrent Backpropagation training
algorithm was able to find “good quality” solutions for large-scale Traveling Salesman
Problem in the range of 100 to 600 cities. These “good quality” solutions for large-scale
variants of the problem were obtained through increased computational effort. It is
significant to note that SRN was able to locate a good quality solution after every attempt.
The computational cost required to employ the SRN as static optimizer algorithm appears to
be relatively high. The initial and incremental values of the constraint weight parameters
need to be determined heuristically and play very important role for properly guiding the
training of the network. However, it was not very difficult to find initial values and the
Page 21 of 27
values for increments of these constraint weight parameters. The average normalized distance
between cities of the travel path computed by the SRN was typically in the range of 0.25 to
0.35, which had an expected value of 0.50, and remained in the same interval as the problem
size was varied from 100 to 600 cities. The Simultaneous Recurrent Neural network scaled
well with the increase in the problem size at the expense of increased computational cost,
which can potentially be overcome, if and when, the hardware realization of the algorithm
becomes feasible.
This simulation-based study further indicated that the SRN trained with RBP as a static
optimizer is a robust algorithm with respect to stability. For randomly specified initial
weight matrix values and node outputs, the neural network dynamics converged to a stable
point after every attempt for a large variation in constraint weight parameter values and
problem size. The neural network algorithm demonstrated that the stability, in the sense of
convergence to a fixed point following a relaxation, exists for the scope of experiments
performed. This empirically observed feature might be a precursor to the existence of a
Liapunov function for the SRN/RBP in a large subspace of the high-dimensional
parameter/weight space. It is also reasonable to expect that incorporation of a stochastic
element into the search process implemented by the SRN is highly likely to improve the
quality of solutions possibly at the expense of increased computational cost.
Further research efforts will concentrate on developing mathematical insight into dynamic
system properties of the SRN, which includes initialization of parameters and weights, the
conditions for existence of stable equilibrium points, limitations and bounds on weight
Page 22 of 27
update formulae. More computationally efficient implementation of the training algorithm
and testing the proposed algorithm on larger problem sizes (over 1000 cities) will also be
pursued.
REFERENCES
L. B. Almeida, “A Learning Rule for Asynchronous Perceptrons with Feedback in a
Combinatorial Environment,” Proceedings of IEEE ICNN, Vol. 2, pp. 609-618, 1987.
K. Chellapilla and D. B. Fogel, "Exploring Self-Adaptive Methods to Improve the Efficiency
of Generating Approximate solutions to Traveling Salesman Problems Using
Evolutionary Programming", Evolutionary Programming, Vol. VI, pp. 361-371,
Springer-Verlag, Berlin, 1997.
A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing,
Wiley, New York, 1993.
P. Baldi, "Gradient Descent Learning Algorithm Overview: A General Dynamical Systems
Perspective," IEEE Transactions on Neural Networks, Vol. 6, No. 1, pp. 182-195, 1995.
A. L. Gall, and V. Zissimopoulos, "Extended Hopfield Models for Combinatorial
Optimization", IEEE Transactions on Neural Networks, Vol. 10, No. 1, pp. 72-80,
January 1999.
A. H. Gee and R. W. Prager, "Limitations of Neural Networks for Solving Traveling
Salesman Problems," IEEE Transactions on Neural Networks, Vol. 6, No. 1, January
1995.
J. Geib, "The Simultaneous Recurrent Network Applied to Large-Scale Traveling Salesman
Problems," A Technical Report to Electrical Engineering and Computer Science
Department, The University of Toledo, August 2000.
M. Grotcschel and O. Holland, "Solution of Large-Scale Symmetrical Traveling Salesman
Problems," Mathematical Programming, Vol. 51:2, pp. 141-202, September 1991.
M. Held and R. M. Karp, “The Traveling-Salesman Problem and Minimum Spanning Trees:
Part II,” Mathematical Programming, Vol. 1, pp. 6-25, 1971.
M. Held and R. M. Karp, “The Traveling-Salesman Problem and Minimum Spanning Trees,”
Operations Research, Vol. 18, pp. 1138-1162, 1970.
J. J. Hopfield & D. W. Tank, “Computing with Neural Networks: A Model,” Science Vol.
233, pp. 625-632, 1986.
J. J. Hopfield, & D. W. Tank, "Neural Computation of Decisions in Optimization Problems,"
Biological Cybernetics, Vol. 52, pp. 141-152, 1985.
D. S. Johnson, L. A. McGeoch, and E. E. Rothberg, “Asymptotic Experimental Analysis for
the Held-Karp Traveling Salesman Bound,” Proceedings of the Seventh Annual ACM-
SIAM symposium on Discrete Algorithms, pp. 341-350, 1996.
S. Matsuda, "Optimal Hopfield Network for Combinatorial Optimization with Linear Cost
Function", IEEE Transactions on Neural Networks, Vol. 9, No. 6, pp. 1319-1330,
November 1998.
X. Pang and P. J. Werbos, “Neural Network Design for J Function Approximation in
Dynamic Programming,” Unpublished research article.
Page 23 of 27
B. A. Pearlmutter, "Gradient Calculations for Dynamic Recurrent Neural Networks: A
Survey," IEEE Transactions on Neural Networks, Vol. 6, No. 5, pp. 1212-1228, 1995.
F. J. Pineda, “Generalization of Backpropagation to Recurrent and Higher Order Networks,”
Proceedings of IEEE ICNIPS, Vol. 59, pp. 2229-2232, 1987.
G. Serpen and D. L. Livingston, "Determination of Weights for Relaxation Recurrent Neural
Networks," Neurocomputing, Vol. 34, No. 1-4, pp. 145-168, September 2000.
G. Serpen and A. Parvin, “On the Performance of Hopfield Network for Graph Search
Problem,” Neurocomputing: An International Journal, Vol. 14, pp. 365-381, 1997.
G. Serpen, D. L. Livingston, and A. Parvin, “Determination of Parameters in Relaxation-
Search Neural Networks for Optimization Problems,” Proceedings of International
Conference on Neural Networks, Houston, TX, Vol. 2, pp. 1125-1129, 1997.
K. Smith, M. Palaniswami and M. Krishnamoorthy, "Neural Techniques for Combinatorial
Optimization with Applications", IEEE Transactions on Neural Networks, Vol. 9, No. 6,
pp. 1301-1318, November 1998.
P. M. E. Shutler, "An Improved Branching Rule for the Symmetric Traveling Salesman
Problem," Journal of the Operational Research Society, Vol. 52:2, pp. 169--175,
February 2001.
M. Wall, Genetic Algorithm Software Library Galib, http://lancet.mit.edu/ga/.
P. J. Werbos, "Brain-like Stochastic Search: A Research Challenge and Funding
Opportunity," unpublished article, 1999.
P. J. Werbos, “Optimization Methods for Brain-like Intelligent Control,” IEEE Conference
on Decision and Control, pp. 579-584, 1995.
P. J. Werbos, “The Brain as a Neurocontroller: New Hypothesis and New Experimental
Possibilities”, in K. Pribram, Ed., Origins: Brain and Self-organization, Hillsdale NJ:
Erlbaum, pp. 680-706, 1994.
P. J. Werbos, “Backpropagation Through Time What It Does and How to Do It”. In
Proceedings of IEEE, Vol. 78, pp. 1550-1560, 1990.
P. J. Werbos, "Generalization of Backpropagation with Application to A Recurrent Gas
Market Model." Neural Networks, Vol. 1, No. 4, pp. 234-242, 1988.
C. L. Valenzuela and A. J. Jones, “Estimating the Held-Karp lower bound for the geometric
TSP,” European Journal of Operational Research, Vol. 102, pp. 157-175, 1997.
Page 24 of 27
APPENDIX
COMPUTATION OF ERROR FUNCTION FOR
RECURRENT BACKPROPAGATION
The error function E is defined in terms of the error value for each individual neuron, ei, in
the output layer by
�×
=
=NN
i
ieE1
2
2
1,
where the dimensions of the output node array is N×N and error is computed as in Equation
4. The derivative of error function E with respect to some weight wkl, where wkl is the weight
between k-th node in the output layer for k=1,2,…, N×N and l-th node in the hidden layer for
l=1,2,…,J, can be conveniently defined in terms of the error value for each individual output
neuron, ei, by
( )�
×
= ∂
∞∂−=
∂
∂ NN
i kl
ii
kl w
ze
w
E
1
.
This equation can be rewritten in terms of an output node array with N rows and N columns,
where i, the index of the output layer neurons, is related to row and column indices q and r
by i = (q –1)�N + r,
( )��
= = ∂
∞∂−=
∂
∂ N
q
N
r kl
qr
qr
kl w
ze
w
E
1 1
(A-1)
Note that, from Equation 3e we have
( )bindisrowcol
klkl
EEEEww
E+++
∂
∂=
∂
∂bin
kl
dis
kl
row
kl
col
kl
Ew
Ew
Ew
Ew ∂
∂
∂
∂
∂
∂
∂
∂+++= . (A-2)
Page 25 of 27
Thus, the derivative of total error function can be computed by adding the derivatives of
individual error terms.
Using the error term due to the row constraint in Equation 3b and taking the derivative with
respect to the wkl yields
( )( )
,121 1 1
�� �= = = ∂
∞∂�
��
∞−−=
∂
∂ N
q
N
r kl
qrN
n
qn
row
kl
row
w
zzg
w
E
while noting that
( )0≠
∂
∞∂
kl
qn
w
z when n = r or l=(q-1)*N+r.
The desirable form for this error term is then given by
( )( )
.121 1 1
�� �= = = ∂
∞∂�
��
−∞=
∂
∂ N
q
N
r kl
qrN
n
qn
row
kl
row
w
zzg
w
E (A-3)
Similary using the error term in Equation 3a for the column constraint and taking the
derivative with respect to the wkl results in
( )( )
�� �= = = ∂
∞∂�
��
∞−−=
∂
∂ N
q
N
r kl
qrN
m
mr
col
kl
col
w
zzg
w
E
1 1 1
12
since
( )0≠
∂
∞∂
kl
mr
w
z only when m = q or l=(q-1)*N+r .
Further simplification of the above equation will yield
( )( )
�� �= = = ∂
∞∂�
��
−∞=
∂
∂ N
q
N
r kl
qrN
m
mr
col
kl
col
w
zzg
w
E
1 1 1
12 . (A-4)
Page 26 of 27
The derivative of the error term due to the distance/cost constraint in Equation 3d with
respect to the wkl is given by
( )( )
( )( )
���= = =
+
+
��
���
∂
∞∂∞+
∂
∞∂∞=
∂
∂ N
q
N
r
N
m kl
qr
rm
kl
rm
qrqm
dis
kl
dis
w
zz
w
zzdg
w
E
1 1 1
)1(
)1(.
The partial derivative vanishes except for the weight from the l-th neuron in the hidden layer
to k-th neuron in the output layer, where l=(q–1)*N + r. Therefore, the first partial derivative
term inside the parenthesis will always be zero. This error term for the distance constraint
simplifies to
( )( )
�� �= = =
+∂
∞∂�
��
∞=
∂
∂ N
q
N
r kl
qrN
m
rmqm
dis
kl
dis
w
zzdg
w
E
1 1 1
)1( . (A-5)
Derivative of the error term for the neuron output constraint in Equation 3c can be computed
as
( )( )
kl
qr
qr
N
q
N
r
bin
kl
bin
w
zzg
w
E
∂
∞∂−∞−=
∂
∂��
= =
)(21 1
α . (A-6)
By substituting the terms in Equations A-3 through A-6 in Equation A-2, the entire partial
derivative becomes
( )( )
( )( )
( )( ) ( )( )
( )�� ���
�� ��� �
= = =+
= =
= = == = =
∂
∞∂�
��
∞+
∂
∞∂−∞−
∂
∞∂�
��
−∞+
∂
∞∂�
��
−∞=
∂
∂
N
q
N
r kl
qrN
m
rmqm
disN
q
N
r kl
qr
qr
bin
N
q
N
r kl
qrN
n
qn
rowN
q
N
r kl
qrN
m
mr
col
kl
w
zzdg
w
zzg
w
zzg
w
zzg
w
E
1 1 1
)1(
1 1
1 1 11 1 1
2
1212
α
. (A-7)
Associating Equation A-7 with Equation A-1 yields
Page 27 of 27
( ) ( ) ( )( ) ( )�
��
∞+∞−+�
��
−∞+�
��
−∞= ���
=+
==
N
m
rmqm
dis
qr
binN
n
qn
rowN
m
mr
col
qr zdgzgzgzge1
)1(
11
21212 α .
for the error term for neuron in q-th row and r-th column of the two-dimensional output
array. This error term can readily be employed in the weight update formula given by
Equation 5 for the Recurrent Backpropagation algorithm to train the Simultaneous Recurrent
Neural Network.