soft computing - backpropagation terms
TRANSCRIPT
Adaptive Learning Rate
With standard steepest descent, the learning rate is held constant throughout training. The
performance of the algorithm is very sensitive to the proper setting of the learning rate. If the
learning rate is set too high, the algorithm can oscillate and become unstable. If the learning rate is
too small, the algorithm takes too long to converge. It is not practical to determine the optimal
setting for the learning rate before training, and, in fact, the optimal learning rate changes during
the training process, as the algorithm moves across the performance surface. You can improve the
performance of the steepest descent algorithm if you allow the learning rate to change during the
training process. An adaptive learning rate attempts to keep the learning step size as large as
possible while keeping learning stable. An adaptive learning rate requires some changes in the
training procedure. First, the initial network output and error are calculated. At each epoch new
weights and biases are calculated using the current learning rate. New outputs and errors are then
calculated.
If the new error exceeds the old error by more than a predefined ratio, the new weights and biases
are discarded. In addition, the learning rate is decreased. Otherwise, the new weights, etc., are kept.
If the new error is less than the old error, the learning rate is increased. This procedure increases the
learning rate, but only to the extent that the network can learn without large error increases. Thus, a
near-optimal learning rate is obtained for the local terrain. When a larger learning rate could result
in stable learning, the learning rate is increased. When the learning rate is too high to guarantee a
decrease in error, it is decreased until stable learning resumes.
Resilient Back propagation
Multilayer networks typically use sigmoid transfer functions in the hidden layers. These functions are
often called "squashing" functions, because they compress an infinite input range into a finite
output range. Sigmoid functions are characterized by the fact that their slopes must approach zero
as the input gets large. This causes a problem when you use steepest descent to train a multilayer
network with sigmoid functions, because the gradient can have a very small magnitude and,
therefore, cause small changes in the weights and biases, even though the weights and biases are far
from their optimal values. The purpose of the resilient backpropagation (Rprop) training algorithm is
to eliminate these harmful effects of the magnitudes of the partial derivatives. Only the sign of the
derivative can determine the direction of the weight update; the magnitude of the derivative has no
effect on the weight update. The size of the weight change is determined by a separate update
value. The update value for each weight and bias is increased by a factor ,whenever the derivative of
the performance function with respect to that weight has the same sign for two successive
iterations. The update value is decreased by a factor, whenever the derivative with respect to that
weight changes sign from the previous iteration. If the derivative is zero, the update value remains
the same. Whenever the weights are oscillating, the weight change is reduced. If the weight
continues to change in the same direction for several iterations, the magnitude of the weight change
increases.
Conjugate Gradient Algorithms
The basic back propagation algorithm adjusts the weights in the steepest descent direction (negative
of the gradient), the direction in which the performance function is decreasing most rapidly. It turns
out that, although the function decreases most rapidly along the negative of the gradient, this does
not necessarily produce the fastest convergence. In the conjugate gradient algorithms a search is
performed along conjugate directions, which produces generally faster convergence than steepest
descent directions. This section presents four variations of conjugate gradient algorithms.
Application
Prediction: learning from past experience Classification: Image processing Recognition: Pattern recognition Data association: e.g. take the noise out of a telephone signal, signal smoothing Planning Data Filtering Planning Uniform crossover
• A random mask is generated • The mask determines which bits are copied from one parent and which from the other
parent • Bit density in mask determines how much material is taken from the other parent (takeover
parameter) Mask: 0110011000 (Randomly generated) Parents: 1010001110 0011010010 Offspring: 0011001010 1010010110