soft computing - backpropagation terms

2

Click here to load reader

Upload: ashwin-gopinath

Post on 24-Mar-2015

49 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Soft Computing - Backpropagation TERMS

Adaptive Learning Rate

With standard steepest descent, the learning rate is held constant throughout training. The

performance of the algorithm is very sensitive to the proper setting of the learning rate. If the

learning rate is set too high, the algorithm can oscillate and become unstable. If the learning rate is

too small, the algorithm takes too long to converge. It is not practical to determine the optimal

setting for the learning rate before training, and, in fact, the optimal learning rate changes during

the training process, as the algorithm moves across the performance surface. You can improve the

performance of the steepest descent algorithm if you allow the learning rate to change during the

training process. An adaptive learning rate attempts to keep the learning step size as large as

possible while keeping learning stable. An adaptive learning rate requires some changes in the

training procedure. First, the initial network output and error are calculated. At each epoch new

weights and biases are calculated using the current learning rate. New outputs and errors are then

calculated.

If the new error exceeds the old error by more than a predefined ratio, the new weights and biases

are discarded. In addition, the learning rate is decreased. Otherwise, the new weights, etc., are kept.

If the new error is less than the old error, the learning rate is increased. This procedure increases the

learning rate, but only to the extent that the network can learn without large error increases. Thus, a

near-optimal learning rate is obtained for the local terrain. When a larger learning rate could result

in stable learning, the learning rate is increased. When the learning rate is too high to guarantee a

decrease in error, it is decreased until stable learning resumes.

Resilient Back propagation

Multilayer networks typically use sigmoid transfer functions in the hidden layers. These functions are

often called "squashing" functions, because they compress an infinite input range into a finite

output range. Sigmoid functions are characterized by the fact that their slopes must approach zero

as the input gets large. This causes a problem when you use steepest descent to train a multilayer

network with sigmoid functions, because the gradient can have a very small magnitude and,

therefore, cause small changes in the weights and biases, even though the weights and biases are far

from their optimal values. The purpose of the resilient backpropagation (Rprop) training algorithm is

to eliminate these harmful effects of the magnitudes of the partial derivatives. Only the sign of the

derivative can determine the direction of the weight update; the magnitude of the derivative has no

effect on the weight update. The size of the weight change is determined by a separate update

value. The update value for each weight and bias is increased by a factor ,whenever the derivative of

the performance function with respect to that weight has the same sign for two successive

iterations. The update value is decreased by a factor, whenever the derivative with respect to that

weight changes sign from the previous iteration. If the derivative is zero, the update value remains

the same. Whenever the weights are oscillating, the weight change is reduced. If the weight

continues to change in the same direction for several iterations, the magnitude of the weight change

increases.

Conjugate Gradient Algorithms

The basic back propagation algorithm adjusts the weights in the steepest descent direction (negative

of the gradient), the direction in which the performance function is decreasing most rapidly. It turns

Page 2: Soft Computing - Backpropagation TERMS

out that, although the function decreases most rapidly along the negative of the gradient, this does

not necessarily produce the fastest convergence. In the conjugate gradient algorithms a search is

performed along conjugate directions, which produces generally faster convergence than steepest

descent directions. This section presents four variations of conjugate gradient algorithms.

Application

Prediction: learning from past experience Classification: Image processing Recognition: Pattern recognition Data association: e.g. take the noise out of a telephone signal, signal smoothing Planning Data Filtering Planning Uniform crossover

• A random mask is generated • The mask determines which bits are copied from one parent and which from the other

parent • Bit density in mask determines how much material is taken from the other parent (takeover

parameter) Mask: 0110011000 (Randomly generated) Parents: 1010001110 0011010010 Offspring: 0011001010 1010010110