the simultaneous recurrent neural network for...

of 27

THE SIMULTANEOUS RECURRENT NEURAL NETWORK

FOR ADDRESSING THE SCALING PROBLEM

IN STATIC OPTIMIZATION

Gursel Serpen, Amol Patwardhan, and Jeff Geib

Authors are with the Electrical Engineering and Computer Science Department of the

University of Toledo, Toledo, OH 43606, USA. Voice (419) 530 8158. Fax (419) 530 8146

and E-mail: [email protected].

Abstract

A trainable recurrent neural network, Simultaneous Recurrent Neural network, is proposed to

address the scaling problem faced by neural network algorithms in static optimization. The

proposed algorithm derives its computational power to address the scaling problem through

its ability to "learn" compared to existing recurrent neural algorithms, which are not

trainable. Recurrent backpropagation algorithm is employed to train the recurrent,

relaxation-based neural network in order to associate fixed points of the network dynamics

with locally optimal solutions of the static optimization problems. Performance of the

algorithm is tested on the NP-hard Traveling Salesman Problem in the range of 100 to 600

cities. Simulation results indicate that the proposed algorithm is able to consistently locate

high-quality solutions for all problem sizes tested. In other words, the proposed algorithm

scales demonstrably well with the problem size with respect to quality of solutions and at the

expense of increased computational cost for large problem sizes.

Keywords: Simultaneous Recurrent Neural Network, Recurrent Backpropagation,

Combinatorial Optimization, Large Scale Problem, Traveling Salesman

of 27

Introduction

Higher-level intelligent systems typically possess ability for dynamic optimization. Another

very significant element of a higher-level intelligent system is the ability to "learn".

Intelligent systems that can effectively learn and, through which, acquire higher levels of

intellectual capacities are more likely to successfully address the challenges inherent in

stochastic real-world environments. Currently, Genetic Algorithms and heuristic-based

search algorithms, neither of which is learning-based, offer the most promising approaches to

address large-scale static optimization problems within reasonable computational cost limits

[Chellapilla and Fogel, 1997; Grotschel and Holland, 1991]. However, their inability to

adapt through learning seriously diminishes their utility for deployment as a building block in

static optimization, to which a dynamic optimization problem leads.

Artificial neural networks (ANN), as static optimizers for large-scale problems and as

systems that can be trained, are real contenders against existing non-learning search

algorithms since ANNs possess the potential to contribute towards solving challenging

dynamic optimization problems. However, a survey of literature indicates that among

significant studies reported, practically all artificial neural network algorithms, which were

applied to static optimization problems, were preprogrammed [Smith et al., 1998; Cichocki,

1993]: no training procedure was applied to adapt weights of the networks.

The Hopfield network (HN) and its derivatives are perhaps the most widely used ANN

algorithms that address static optimization problems; they topologically belong to the class of

single-layer, relaxation-type recurrent ANNs. The HN derivatives rely on gain scheduling as

in simulated annealing, network with nodes modeled by lossless integrators, network of

of 27

nodes with unipolar activation functions, network with additive uncorrelated noise with zero

mean and a variance gradually decreasing in time, mean-field theory network, and mean-field

annealing network, among others.

Studies for particularly large-scale static optimization problems using artificial neural

networks, especially the HN and its derivatives, are scarce in the literature [Smith et. al.,

1998; Matsuda et. al., 1998; Gall et. al., 1999]. In the case of the TSP, most reported studies

consider the 100-city problem as the norm to demonstrate the validity of the research

findings. This creates a reasonable degree of uncertainty if the proposed algorithms, which

deliver good performance for 100 or 200 cities, will be able to scale well for much larger

problem sizes. Specifically, maintaining the quality of solutions consistent as the problem

size increases is of primary interest and concern when assessing the scalability property of a

given neural optimizer algorithm.

A typical search session for the HN and its derivatives is likely to include numerous

relaxations. A relaxation starts with initialization of the parameters associated with network

dynamics, continues with updates of dynamic system states/outputs, and concludes upon

convergence of network states/outputs to a stable equilibrium point, assuming at least one

exists. After each unsuccessful relaxation, the network is simply reinitialized for the next

relaxation, ignoring the experience associated with the already completed relaxation. The

HN and its derivatives lack a mechanism to incorporate the experience gained following the

previous relaxation cycles. In other words, they do not employ any learning that allows the

of 27

neural search algorithm to benefit from the results of prior relaxations, since weights are

preprogrammed and not modified afterwards.

A learning-based recurrent neural search algorithm is expected to offer significant

performance improvements over a non-learning based algorithm. One such neural paradigm,

the Simultaneous Recurrent Neural network (SRN) [Werbos, 1992 and 1994; Pang and

Werbos, 1997] incorporates powerful features: it is a recurrent algorithm with relaxation

search capability, while also being trainable. The Simultaneous Recurrent Neural network

has the potential to develop, through "learning", the ability to address the computationally

challenging task of large-scale static optimization. This forms a very important first step

towards eventually addressing dynamic optimization problems through algorithms that can

formulate learning-based solutions.

This study proposes using a learning-based artificial neural network algorithm, the

Simultaneous Recurrent Neural network, to address large-scale static optimization problems.

Existing neural optimizer algorithms are either trainable feedforward architectures or

recurrent architectures with preprogrammed weight structures. The novel contribution of this

study is using a neural network that is both trainable, and at the same time recurrent, to

address large-scale static optimization problems.

Simultaneous Recurrent Network

A Simultaneous Recurrent Network (SRN) is an artificial neural network [Werbos, 1994;

Pang and Werbos, 1997] with the graphical representation as in Figure 1.

of 27

Figure 1. Simultaneous Recurrent Network Graphical Representation.

The system has external inputs in the form of a vector x, a feed-forward vector function F,

(any feed-forward network, including the multi-layer perceptron, is appropriate), outputs in

the form of a vector z, and a feedback path which copies the outputs to inputs without a time

delay.

The feed-forward network F will also induce a weight matrix W, which represents the

interconnection topology of the network. The network, starting from an initial state as

indicated by the initial value of the output vector, will iterate until the output vector z

stabilizes and converges to a stable point given that one exists. In other terms, an SRN is

based on a feed-forward network with simultaneous feedback from outputs of the network to

its inputs. An SRN exhibits complex temporal behavior: it follows a trajectory in the state

space to relax to a fixed point. One "relaxation" of the network consists of one or more

iterations of output computation and propagation along the feed-forward and feedback paths

until the outputs converge to a stable equilibrium value.

Feedforward

Mapping

F(z, x,W)

Output - z

Feedback Path

Input - x

of 27

A more formal description of SRN is formulated in [Werbos; 1992] who defines an SRN as a

mapping

)(ˆ Wx,Fz = (1a)

where x and W are the external inputs and weights, respectively and z is the equilibrium

value of z e.g.:

( ) ,limˆ n

nzz

∞→= (1b)

which can be computed by the following iteration

Fz =+ )1(n (z(n)

,x,W), (1c)

where F is a feedforward network and n is the iteration index with very fast computation

cycles compared to feedback delays found in time-lagged recurrent networks.

The network is provided with the external inputs and initial outputs, which are typically

assumed randomly in the absence of a priori information. The output of previous iteration is

fed back to the network along with the external inputs to compute the output of next iteration.

The network is allowed to iterate until it reaches a stable equilibrium point, assuming at least

one exists. External inputs are applied throughout the complete relaxation cycle. When a

stable equilibrium point is reached, the outputs stop changing (i.e. the output value of )1( +nz

is almost equal to or very close to )(nz ). It is important to note that the feedback from the

output layer to the input layer in SRN is not delayed: the feedback is, theoretically speaking,

simultaneous.

of 27

SRN As A Static Optimizer for the TSP

The Traveling Salesman Problem (TSP) was chosen as the benchmark for the performance

evaluation of the SRN because it is representative of NP-hard optimization problems. In the

TSP, a salesman spends his time visiting N cities (or nodes) cyclically. In one tour, he visits

each city just once, and concludes where he starts. The goal is to find the order he should

visit the cities to minimize the total distance traveled. Selection of the TSP as the benchmark

is appropriate because almost all non-learning neural search algorithms fail to deliver

acceptable quality solutions with reasonable computational cost and time for large-scale

variants of this problem.

SRN Topology for the TSP

An N-city TSP is represented by an N×N array, where each row represents a different city

and each column represents the possible positions of the cities in the path. In order to

represent an N-city TSP using the SRN, the feedforward network F in the SRN consists of

two layers. The output layer is an N×N matrix of nodes, with each row representing a city

and each column representing a possible position in the path. Additionally, there is a single

layer of hidden nodes. In the TSP, the inputs to the problem are the distances between the

cities, represented by the cost matrix. As presented in the next section, the cost matrix is

used in the calculation of the error function of the training algorithm. Since the cost matrix is

used as an input to the error function, it does not need to also be included as an external input

to the SRN. Therefore, in the case of the TSP the external inputs x in Figure 1 are not

applied to the network.

of 27

The SRN for the TSP consists of a two-layer network with a relatively small number of

hidden nodes and an N×N array of output nodes, with a recurrent connection between the

nodes in the output layer and the hidden layer. Since no external inputs exist, the network is

simply initialized with small random values for the weights and the outputs z and allowed to

relax. Once the network converges to a fixed point, the solution computed by the network is

considered to have materialized. It is of no relevance the trajectories the neural network

dynamics follow in the phase space. The fixed points/stable equilibrium points the

trajectories lead to are the computationally useful entities for the TSP. After the outputs of

the network converge to a fixed point, the outputs can be compared against problem-specific

error function and the weights modified using a suitable learning algorithm.

Figure 2 shows the architecture of the SRN for the TSP, where the feedforward network F

consists of one hidden layer and one output layer. The SRN has trainable connections from

each neuron of the hidden layer to each neuron of the output layer, which is represented with

the forward weight matrix P. The simultaneous recurrent connections, which are also

trainable, exist from each node in the output layer to each node in the hidden layer, and are

represented by the backward weight matrix V. There are no connections among the nodes in

either the hidden layer or the output layer. All of the connections in the network are

trainable. The outputs z of the network are taken from each output layer neuron. As

discussed above, the external inputs x are not necessary.

of 27

Figure 2. SRN Architecture for the TSP.

Neuron dynamics for the output layer of SRN are represented by

�=

+−=J

j

jiji

i ypsdt

ds

1

and

)( ii sfz = for ,,,2,1 NNi ×= � (2a)

where jy is the output of j-th neuron in hidden layer, J is the node count in hidden layer, ijp

is the forward weight from j-th neuron in hidden layer to i-th neuron in output layer, N×N is

the dimensions of the output array, and f is continuous, differentiable function typically a

sigmoid with a steep slope (for combinatorial optimization problems). Similarly, for a

neuron jy in the hidden layer, the dynamics is defined by

�×

=

+−=NN

i

ijij

jzvs

dt

ds

1

and

)( jj sfy = for Jj ,,2,1 �= , (2b)

where iz is the output of i-th neuron in output layer, J is the node count in hidden layer, jiv

is the backward weight from i-th neuron in output layer to j-th neuron hidden layer, N×N is

the dimensions of the output array, and f is a continuous and differentiable function, typically

a sigmoid with a steep slope.

P

V

J-Node Hidden Layer N×N Node Output Layer

of 27

Training the SRN with the RBP

In order to train the SRN, it is necessary to define a measure of the error for the output

neurons. The error function needs to ensure valid solutions, as well as a minimum path

length. Certain constraints need to be in place to ensure a valid solution. Given the problem

representation presented above, each row and column in the N×N output array must have

exactly one neuron active: the output value of an active neuron should be close to 1.0 while

the output of each inactive neuron in the N×N array must approach the limiting value of 0.0.

The traveling salesman should travel all cities at least and only once. Thus, when network

converges to a solution, there should be exactly one node active per each row and per each

column. This constraint can be implemented using inhibition among the nodes in a given

row and column. The error term for the column constraint is defined by

( )�� = = =

��

��

�∞−=

N

i

N

j

N

m

mj

col

col zgE1 1

2

1

1 , (3a)

where i and j are the indices for rows and columns, respectively, m is the index for rows of

the network, ( )∞mjz is the stable value of mj-th neuron output upon convergence to a fixed

point, and gcol

is a positive real weight parameter. When each column of the output matrix

has exactly one active neuron, this error term will be zero. The first summation over the

indexing variable i is included because the error function needs to be defined for each neuron

in the output layer.

Similarly, the error term for the row constraint is given by

of 27

( )�� = = =

��

��

�∞−=

N

i

N

j

N

n

in

row

row zgE1 1

2

1

1 ��

where, i and j are the indices for rows and columns, respectively, of the network, n is the

index for columns and grow

is a positive real weight parameter. This error term will have a

value of zero when each row of the output matrix has exactly one active neuron. Again, the

second summation over the index variable j is included since the error function needs to be

defined for every ij-th neuron in the output layer.

An error term is also introduced that forces the neuron outputs to limiting values of 0.0 or 1.0

as

( )( )[ ]��= =

+−∞−=N

i

N

j

ij

bin

bin zgE1 1

2βα , (3c)

where α and β are constants and bing is the positive real weight parameter for this constraint.

This error term is a downward opening quadratic function. By choosing values of α = 0.5

and β = 0.25, the zeros of the function are at 0.0 and 1.0. Thus, this error term has a

minimum value of zero when all of the neurons in the output layer have their output values at

either 0.0 or 1.0.

The error term associated with the distance between the cities can be formulated as

( ) ( )��= = =

+ ∞∞=N

i

N

j

N

m

imjmij

dis

dis dzzgE1 1 1

)1( , (3d)

where dim is the cost associated with the path from city i to city m and gdis

is the positive real

weight parameter for this constraint. For each neuron zij, the index m searches each neuron in

the (j+1)st column, indicated by the zm(j+1) term. If both neurons are active, the distance from

of 27

city i to city m, dim, will be included in this error term, where the minimum value is achieved

if the total distance of the path is minimum.

The total error function E is the sum of each individual error terms defined by

disbinrowcol EEEEE +++= . (3e)

In order to converge on a valid and good (with minimum total distance) solution for the TSP,

the state space portrait of the SRN must be changed by moving the fixed points towards

preferably good solutions of the TSP. This is accomplished by altering the weights of the

network using a training algorithm. To train the SRN, recurrent backpropagation (RBP), a

variant of the traditional backpropagation algorithm, is used. This is a gradient descent

learning method for recurrent neural networks, which reshapes the state space portrait of the

network based on a defined error measure.

The full derivation of the RBP algorithm can be found in [Pineda, 1987; Werbos, 1988]. The

RBP training algorithm requires an adjoint network, which is topologically identical to the

SRN except all signal directions reversed, to be set up and relaxed to compute updates for the

weights of the SRN. The adjoint network accepts the error, which is computed using the

stable values of neurons in the output layer of the SRN upon convergence to a fixed point, as

external inputs to neurons in the input layer, which is the output layer for the SRN.

The RBP training algorithm for the SRN is implemented as follows. Upon convergence of

the SRN dynamics to a fixed point, error values for output nodes need to be computed. The

error for an output node is computed by the following formula:

of 27

( )∞−= iii ze τ , (4)

where ( )∞iz is the stable output value of i-th neuron in the output layer upon convergence to

a fixed point with NNi ×= ,,2,1 � and �i is the desirable value of the i-th neuron output.

Next, an adjoint network, the topology of which is identical to that of SRN with all signal

directions reversed, i.e., ij-th element of the forward/backward weight matrix for the adjoint

network is equal to ji-th element of the forward/backward weight matrix of the SRN,

respectively, and output/input layers relabeled as input/output, i.e., zy ←* and yz ←* ,

where starred vectors/matrices are associated with the adjoint network, is set up with the

following linear dynamics:

�×

=

+−=NN

i

ijij

jypz

dt

dz

1

***

*

for Jj ,,2,1 �= and

�=

++−=J

j

ijiji

i ezvydt

dy

1

***

*

for NNi ×= ,,2,1 � (5)

for the neurons in the output layer (an 1×J array) and input layer (an N×N array) of the

adjoint network, respectively, while noting that

ijji pp =* and jiij vv =* .

Following the derivation in the Appendix, error terms for each neuron in the output layer is

computed by

( ) ( ) ( )( ) ( )�

��

∞+∞−+�

��

−∞+�

��

−∞= ��

=+

==

N

m

rmqm

dis

qr

binN

n

qn

rowN

m

mr

col

qr zdgzgzgzge1

)1(

11

21212 α

(6)

where q and Nr ,,2,1 �= and i, the index for the output layer neurons, is related to the row

index q and column index r by ( ) rNqi +−= 1 .

of 27

Noting that local stability of the SRN dynamics is a sufficient condition for the convergence

of the adjoint network dynamics [Pineda, 1987; Almeida, 1987], once the adjoint network

converges, weight updates can be computed as

( )[ ] ( ) ( )∞∞∞′=∂

∂−=∆ jii

ij

ij yysfp

Ep

*ηη and

( )[ ] ( ) ( )∞∞∞′=∂

∂−=∆ ijj

ji

ji zzsfv

Ev

*ηη

for the forward and backward weight matrix entries, respectively, where η is the learning rate

and f ′ is the derivative of the function f.

Parameter Definitions for SRN and RBP

Many different variables exist in the SRN and the RBP training algorithm that can affect the

ability of the network to converge, the speed of finding a solution, the ability to find a

solution, and the quality of the solution to the TSP, among others. Within the architecture of

the SRN itself, several choices have to be made. The structure of the network F can be any

feedforward network. In this case, a multi-layer perceptron (MLP) with one hidden layer and

output layer, which corresponds to a minimal topology, was chosen. For both layers of the

SRN, neuron activation functions were modeled with unipolar continuous sigmoid in the

range of [0.0,1.0] with a relatively large steepness value of 100.

Another empirically determined parameter is the number of nodes in the hidden layer, which

can have a drastic effect on the speed of convergence of the network. A twofold increase in

the number of hidden nodes will double the number of weights from the hidden layer to

output layer and from the output layer to hidden layer as well, which also doubles the

of 27

memory requirements. This in turn increases the number of calculations required as well as

leading to longer relaxation times along with larger number of relaxation counts to locate a

solution. The choice of five hidden layer nodes provided reasonable relaxation times and

counts to compute a solution while keeping the memory requirements manageable.

The weights and outputs of the SRN also need to be initialized. Small random numbers,

uniformly distributed in the interval [–0.2,0.2], were used to initialize the two weight

matrices, P and V. The outputs of the SRN were initialized to uniformly distributed random

values in the interval [0.0,1.0]. Once the training began, the outputs were not re-initialized

subsequent to each relaxation during the training: simply the previous outputs were used.

The error function includes several weight parameters that affect the scaling of each

individual error term. It was empirically determined that the precise values of these

parameters had little effect on the network, but the ratio between the parameters did. For

example, the normalized row and column error terms were roughly 10 to 50 times larger than

the distance error term at the start of training. In order to keep the row and column terms

from monopolizing the total error, the distance weight parameter was initialized to be 10

times larger than other weight parameters.

Additionally, the weight parameters need to be incremented during the training process. If

the parameters are not incremented, it was found that the training algorithm fails to find a

solution. After a number of iterations, the training fails to provide improved results. By

incrementing the weight parameters every so many iterations, the algorithm does eventually

converge on a solution. Incrementing the weight parameters introduces two additional

variables to the problem: the amount of the increment, and the frequency of increment.

of 27

Incrementing the parameters by small values, on the order of 0.004, provided good results.

Incrementing the parameters every five relaxations provided the quickest convergence

towards a solution. Incrementing the parameters more frequently sped up the algorithm, but

five relaxations was found to be the lower limit on this variable before the algorithm again

failed to converge.

The values of the parameters can have a drastic effect on the quality of solution and the time

it takes to converge. If too much emphasis is placed on the row and column error terms, a

solution will be found quickly, but the path length will only be average. If too much

emphasis is placed on the distance term, the path distance may become very small, but a

valid path that satisfy the row and column constraints will not be found after very many

iterations. The values in Table 1 provided good results for problems of all sizes tested.

Initial Value Increment

gcol

0.003 0.004

grow

0.003 0.004

gbin

0.003 0.004

gdis

0.010 0.020

Table 1. Error Function Parameter Values and Increments.

When to stop the training of the network is another important question. It is not known what

the best solution is; therefore, it is not possible to stop training when a best solution is

reached. One measure is to stop training when a valid solution is reached, even though it

may not be the best or even a good solution. This is the only criterion available to use,

however. It is important to keep in mind that further training past this point could achieve a

better solution, especially depending on the weight parameter values.

of 27

Simulation Study Results

Simulations were performed for problems sizes of 100 to 600 cities in increments of 100

cities. The distances between the cities, or the cost matrix, were initialized with uniformly

random numbers in the interval [0.0,1.0]. It is noted that a random choice of a path should

result in an average distance of 0.5 between any two neighboring cities. The diagonal

elements of the cost matrix were set to 1.0, which represents the maximum distance, to

prevent looping from one city to back to itself. The simulations were performed on a Sun

Ultra 10 Workstation with dual 300MHz CPUs and 1.2 GB RAM, running SunOS� 5.7.

Table 2 below shows the simulation results. To determine the computational time, the UNIX

command timex was used. From this, the total user time and the total system time were

added together. The sum of these values gives the total CPU time used by the process, which

is invariant to other processes running in the background.

Table 2. SRN Simulation Results for the TSP.

There are two important observations that can be made based on the data in Table 2: the

quality of solutions found for any problem size is comparably much better than that of

expected value of a randomly chosen solution, and the quality of solutions does not

deteriorate as the problem size is increased from 100 to 600 cities. The significant

Number

of Cities

Average

Distance

Between Cities

Average

Number of

Relaxations

Average

Computation

Time (minutes)

100 0.24 650 10

200 0.30 1700 135

300 0.26 3000 634

400 0.27 3200 1070

500 0.26 3500 1550

600 0.29 4010 2409

of 27

implication is that the SRN is a "good" neural optimizer for its ability to compute high-

quality solutions and that the SRN also offers ability to scale up with the increases in the

problem size in terms of its ability to deliver consistently high quality solutions.

Comparative Performance Assessment

The TSP was extensively studied in the combinatorial optimization field as a benchmark

problem. Non-neural search algorithms employing heuristics report success for large-scale

problems for both the quality of solutions and computational efficiency [Shutler, 2001;

Grotschel and Holland, 1991]. Genetic Algorithms (GA) are among the most

computationally efficient and effective (computational promise to locate nearly global

optima) solution algorithms for the TSP [Chellapilla and Fogel, 1997]. Concurrent efforts to

solve the TSP using neural algorithms employed Hopfield recurrent networks and its

stochastic derivatives, i.e. Mean-Field Annealing, and Boltzmann Machine, as well as

derivatives of self-organizing neural networks [Gee and Prager, 1995]. One important

common feature, or shortcoming, of all these algorithms is the fact that they are all non-

learning: they do not have the ability to improve their performance based on the experience.

Furthermore, in the case of heuristics-based search algorithms, these algorithms are not

generalized search algorithms since heuristics are often problem-specific for maximum

utility. Similar observations apply to self-organizing neural networks since these algorithms

are highly specialized for a given problem without much potential for general applicability to

a class of problems.

The Hopfield network and its derivatives are frequently used to address optimization

problems. The Hopfield network is a single layer, fully connected relaxation-type network.

of 27

In previous studies [Serpen et. al., 2000; Serpen et. al., 1997; Gee & Prager, 1995; Smith et.

al., 1998], its application to even small-scale TSPs produced mostly average-quality

solutions. As the problem size increased to even 100 cities, the solution quality tended to

average more markedly indicating the inability of the Hopfield network to scale to even

moderately large problems.

The Boltzman Machine is a version of the Hopfield network with a stochastic search

component. The Boltzman Machine suffers from the excessive memory requirement for

large-scale problems. For an N-city problem, the network will require N2 neurons and N

4

weights since each neuron is connected to every other neuron. For a 1,000-city problem, this

is 1,000,000 neurons and 1012

weights. This number of weights requires prohibitively large

memory storage, in addition to tremendous computational power to allow this network to

relax following a sequential annealing schedule and calculate the weight updates in a

reasonable amount of time unless hardware realization of the algorithm becomes feasible.

Given the memory and computational time requirements of the Boltzmann Machine,

simulation of the algorithm is not practical to empirically assess its true computational

promise for large-scale problems as also the scarcity of such simulations reported in the

published literature indicates [Gee and Prager, 1995, Smith et al., 1998].

Evolutionary computing is most likely the foremost field for computing the global optimum

of a function [Werbos, 1999]. In order to facilitate a performance comparison between the

SRN and the Genetic Algorithm, the GA was applied to the same TSP set [Geib, 2000]. The

software used was the GAlib� Genetic Algorithm package version 2.4.5 [Wall, 2001],

of 27

which was instantiated to implement a steady-state genetic algorithm, with 1% of the

population replaced each generation. An ordered list of cities was used as the genome, and

the genetic operator was an edge recombination crossover operator. Partial match crossover

was also tried, but performed poorly. The population size was specified as 100. The

population was allowed to evolve until the best solution from two consecutive populations

was within a specified error tolerance of 0.01. Overall, the GA was able to compute better

quality solutions than the SRN/RBP algorithm at a much less computational cost for up to

600 cities for the TSP [Geib, 2000]: the GA was able to locate solutions with normalized

total distance in the range [0.7,0.15] for the same TSP instances. Although the GA

performed well for the relatively large-scale TSPs, a recent conjecture [Werbos, 1999]

suggests that the GA is not likely to handle problems with a very large number of variables

due to its lack of ability to adapt through learning.

Conclusions

The Simultaneous Recurrent Neural network with Recurrent Backpropagation training

algorithm was able to find “good quality” solutions for large-scale Traveling Salesman

Problem in the range of 100 to 600 cities. These “good quality” solutions for large-scale

variants of the problem were obtained through increased computational effort. It is

significant to note that SRN was able to locate a good quality solution after every attempt.

The computational cost required to employ the SRN as static optimizer algorithm appears to

be relatively high. The initial and incremental values of the constraint weight parameters

need to be determined heuristically and play very important role for properly guiding the

training of the network. However, it was not very difficult to find initial values and the

of 27

values for increments of these constraint weight parameters. The average normalized distance

between cities of the travel path computed by the SRN was typically in the range of 0.25 to

0.35, which had an expected value of 0.50, and remained in the same interval as the problem

size was varied from 100 to 600 cities. The Simultaneous Recurrent Neural network scaled

well with the increase in the problem size at the expense of increased computational cost,

which can potentially be overcome, if and when, the hardware realization of the algorithm

becomes feasible.

This simulation-based study further indicated that the SRN trained with RBP as a static

optimizer is a robust algorithm with respect to stability. For randomly specified initial

weight matrix values and node outputs, the neural network dynamics converged to a stable

point after every attempt for a large variation in constraint weight parameter values and

problem size. The neural network algorithm demonstrated that the stability, in the sense of

convergence to a fixed point following a relaxation, exists for the scope of experiments

performed. This empirically observed feature might be a precursor to the existence of a

Liapunov function for the SRN/RBP in a large subspace of the high-dimensional

parameter/weight space. It is also reasonable to expect that incorporation of a stochastic

element into the search process implemented by the SRN is highly likely to improve the

quality of solutions possibly at the expense of increased computational cost.

Further research efforts will concentrate on developing mathematical insight into dynamic

system properties of the SRN, which includes initialization of parameters and weights, the

conditions for existence of stable equilibrium points, limitations and bounds on weight

of 27

update formulae. More computationally efficient implementation of the training algorithm

and testing the proposed algorithm on larger problem sizes (over 1000 cities) will also be

pursued.

REFERENCES

L. B. Almeida, “A Learning Rule for Asynchronous Perceptrons with Feedback in a

Combinatorial Environment,” Proceedings of IEEE ICNN, Vol. 2, pp. 609-618, 1987.

K. Chellapilla and D. B. Fogel, "Exploring Self-Adaptive Methods to Improve the Efficiency

of Generating Approximate solutions to Traveling Salesman Problems Using

Evolutionary Programming", Evolutionary Programming, Vol. VI, pp. 361-371,

Springer-Verlag, Berlin, 1997.

A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing,

Wiley, New York, 1993.

P. Baldi, "Gradient Descent Learning Algorithm Overview: A General Dynamical Systems

Perspective," IEEE Transactions on Neural Networks, Vol. 6, No. 1, pp. 182-195, 1995.

A. L. Gall, and V. Zissimopoulos, "Extended Hopfield Models for Combinatorial

Optimization", IEEE Transactions on Neural Networks, Vol. 10, No. 1, pp. 72-80,

January 1999.

A. H. Gee and R. W. Prager, "Limitations of Neural Networks for Solving Traveling

Salesman Problems," IEEE Transactions on Neural Networks, Vol. 6, No. 1, January

1995.

J. Geib, "The Simultaneous Recurrent Network Applied to Large-Scale Traveling Salesman

Problems," A Technical Report to Electrical Engineering and Computer Science

Department, The University of Toledo, August 2000.

M. Grotcschel and O. Holland, "Solution of Large-Scale Symmetrical Traveling Salesman

Problems," Mathematical Programming, Vol. 51:2, pp. 141-202, September 1991.

M. Held and R. M. Karp, “The Traveling-Salesman Problem and Minimum Spanning Trees:

Part II,” Mathematical Programming, Vol. 1, pp. 6-25, 1971.

M. Held and R. M. Karp, “The Traveling-Salesman Problem and Minimum Spanning Trees,”

Operations Research, Vol. 18, pp. 1138-1162, 1970.

J. J. Hopfield & D. W. Tank, “Computing with Neural Networks: A Model,” Science Vol.

233, pp. 625-632, 1986.

J. J. Hopfield, & D. W. Tank, "Neural Computation of Decisions in Optimization Problems,"

Biological Cybernetics, Vol. 52, pp. 141-152, 1985.

D. S. Johnson, L. A. McGeoch, and E. E. Rothberg, “Asymptotic Experimental Analysis for

the Held-Karp Traveling Salesman Bound,” Proceedings of the Seventh Annual ACM-

SIAM symposium on Discrete Algorithms, pp. 341-350, 1996.

S. Matsuda, "Optimal Hopfield Network for Combinatorial Optimization with Linear Cost

Function", IEEE Transactions on Neural Networks, Vol. 9, No. 6, pp. 1319-1330,

November 1998.

X. Pang and P. J. Werbos, “Neural Network Design for J Function Approximation in

Dynamic Programming,” Unpublished research article.

of 27

B. A. Pearlmutter, "Gradient Calculations for Dynamic Recurrent Neural Networks: A

Survey," IEEE Transactions on Neural Networks, Vol. 6, No. 5, pp. 1212-1228, 1995.

F. J. Pineda, “Generalization of Backpropagation to Recurrent and Higher Order Networks,”

Proceedings of IEEE ICNIPS, Vol. 59, pp. 2229-2232, 1987.

G. Serpen and D. L. Livingston, "Determination of Weights for Relaxation Recurrent Neural

Networks," Neurocomputing, Vol. 34, No. 1-4, pp. 145-168, September 2000.

G. Serpen and A. Parvin, “On the Performance of Hopfield Network for Graph Search

Problem,” Neurocomputing: An International Journal, Vol. 14, pp. 365-381, 1997.

G. Serpen, D. L. Livingston, and A. Parvin, “Determination of Parameters in Relaxation-

Search Neural Networks for Optimization Problems,” Proceedings of International

Conference on Neural Networks, Houston, TX, Vol. 2, pp. 1125-1129, 1997.

K. Smith, M. Palaniswami and M. Krishnamoorthy, "Neural Techniques for Combinatorial

Optimization with Applications", IEEE Transactions on Neural Networks, Vol. 9, No. 6,

pp. 1301-1318, November 1998.

P. M. E. Shutler, "An Improved Branching Rule for the Symmetric Traveling Salesman

Problem," Journal of the Operational Research Society, Vol. 52:2, pp. 169--175,

February 2001.

M. Wall, Genetic Algorithm Software Library Galib, http://lancet.mit.edu/ga/.

P. J. Werbos, "Brain-like Stochastic Search: A Research Challenge and Funding

Opportunity," unpublished article, 1999.

P. J. Werbos, “Optimization Methods for Brain-like Intelligent Control,” IEEE Conference

on Decision and Control, pp. 579-584, 1995.

P. J. Werbos, “The Brain as a Neurocontroller: New Hypothesis and New Experimental

Possibilities”, in K. Pribram, Ed., Origins: Brain and Self-organization, Hillsdale NJ:

Erlbaum, pp. 680-706, 1994.

P. J. Werbos, “Backpropagation Through Time What It Does and How to Do It”. In

Proceedings of IEEE, Vol. 78, pp. 1550-1560, 1990.

P. J. Werbos, "Generalization of Backpropagation with Application to A Recurrent Gas

Market Model." Neural Networks, Vol. 1, No. 4, pp. 234-242, 1988.

C. L. Valenzuela and A. J. Jones, “Estimating the Held-Karp lower bound for the geometric

TSP,” European Journal of Operational Research, Vol. 102, pp. 157-175, 1997.

of 27

APPENDIX

COMPUTATION OF ERROR FUNCTION FOR

RECURRENT BACKPROPAGATION

The error function E is defined in terms of the error value for each individual neuron, ei, in

the output layer by

�×

=

=NN

i

ieE1

2

2

1,

where the dimensions of the output node array is N×N and error is computed as in Equation

4. The derivative of error function E with respect to some weight wkl, where wkl is the weight

between k-th node in the output layer for k=1,2,…, N×N and l-th node in the hidden layer for

l=1,2,…,J, can be conveniently defined in terms of the error value for each individual output

neuron, ei, by

( )�

×

= ∂

∞∂−=

∂

∂ NN

i kl

ii

kl w

ze

w

E

1

.

This equation can be rewritten in terms of an output node array with N rows and N columns,

where i, the index of the output layer neurons, is related to row and column indices q and r

by i = (q –1)�N + r,

( )��

= = ∂

∞∂−=

∂

∂ N

q

N

r kl

qr

qr

kl w

ze

w

E

1 1

(A-1)

Note that, from Equation 3e we have

( )bindisrowcol

klkl

EEEEww

E+++

∂

∂=

∂

∂bin

kl

dis

kl

row

kl

col

kl

Ew

Ew

Ew

Ew ∂

∂

∂

∂

∂

∂

∂

∂+++= . (A-2)

of 27

Thus, the derivative of total error function can be computed by adding the derivatives of

individual error terms.

Using the error term due to the row constraint in Equation 3b and taking the derivative with

respect to the wkl yields

( )( )

,121 1 1

�� = = = ∂

∞∂�

��

∞−−=

∂

∂ N

q

N

r kl

qrN

n

qn

row

kl

row

w

zzg

w

E

while noting that

( )0≠

∂

∞∂

kl

qn

w

z when n = r or l=(q-1)*N+r.

The desirable form for this error term is then given by

( )( )

.121 1 1

�� = = = ∂

∞∂�

��

−∞=

∂

∂ N

q

N

r kl

qrN

n

qn

row

kl

row

w

zzg

w

E (A-3)

Similary using the error term in Equation 3a for the column constraint and taking the

derivative with respect to the wkl results in

( )( )

�� = = = ∂

∞∂�

��

∞−−=

∂

∂ N

q

N

r kl

qrN

m

mr

col

kl

col

w

zzg

w

E

1 1 1

12

since

( )0≠

∂

∞∂

kl

mr

w

z only when m = q or l=(q-1)*N+r .

Further simplification of the above equation will yield

( )( )

�� = = = ∂

∞∂�

��

−∞=

∂

∂ N

q

N

r kl

qrN

m

mr

col

kl

col

w

zzg

w

E

1 1 1

12 . (A-4)

of 27

The derivative of the error term due to the distance/cost constraint in Equation 3d with

respect to the wkl is given by

( )( )

( )( )

��= = =

+

+

��

��

∂

∞∂∞+

∂

∞∂∞=

∂

∂ N

q

N

r

N

m kl

qr

rm

kl

rm

qrqm

dis

kl

dis

w

zz

w

zzdg

w

E

1 1 1

)1(

)1(.

The partial derivative vanishes except for the weight from the l-th neuron in the hidden layer

to k-th neuron in the output layer, where l=(q–1)*N + r. Therefore, the first partial derivative

term inside the parenthesis will always be zero. This error term for the distance constraint

simplifies to

( )( )

�� = = =

+∂

∞∂�

��

∞=

∂

∂ N

q

N

r kl

qrN

m

rmqm

dis

kl

dis

w

zzdg

w

E

1 1 1

)1( . (A-5)

Derivative of the error term for the neuron output constraint in Equation 3c can be computed

as

( )( )

kl

qr

qr

N

q

N

r

bin

kl

bin

w

zzg

w

E

∂

∞∂−∞−=

∂

∂��

= =

)(21 1

α . (A-6)

By substituting the terms in Equations A-3 through A-6 in Equation A-2, the entire partial

derivative becomes

( )( )

( )( )

( )( ) ( )( )

( )��

��

= = =+

= =

= = == = =

∂

∞∂�

��

∞+

∂

∞∂−∞−

∂

∞∂�

��

−∞+

∂

∞∂�

��

−∞=

∂

∂

N

q

N

r kl

qrN

m

rmqm

disN

q

N

r kl

qr

qr

bin

N

q

N

r kl

qrN

n

qn

rowN

q

N

r kl

qrN

m

mr

col

kl

w

zzdg

w

zzg

w

zzg

w

zzg

w

E

1 1 1

)1(

1 1

1 1 11 1 1

2

1212

α

. (A-7)

Associating Equation A-7 with Equation A-1 yields

of 27

( ) ( ) ( )( ) ( )�

��

∞+∞−+�

��

−∞+�

��

−∞= ��

=+

==

N

m

rmqm

dis

qr

binN

n

qn

rowN

m

mr

col

qr zdgzgzgzge1

)1(

11

21212 α .

for the error term for neuron in q-th row and r-th column of the two-dimensional output

array. This error term can readily be employed in the weight update formula given by

Equation 5 for the Recurrent Backpropagation algorithm to train the Simultaneous Recurrent

Neural Network.

the simultaneous recurrent neural network for...

Documents