neural network learning without backpropagation

NEURAL NETWORK LEARNING WITHOUT BACKPROPAGATION

Presented byShujon Naha

INTRODUCTION

The popular EBP(Error Back Propagation) algorithm is relatively simple and it can handle problems with basically an unlimited number of patterns.

Also, because of its simplicity, it was relatively easy to adopt the EBP algorithm for more efficient neural network architectures where connections across layers are allowed .

However, the EBP algorithm can be up to 1000 times slower than more advanced second-order algorithms

INTRODUCTION(CONTD.)

As EBP uses first order algorithms so no major improvements can be done.

The very efficient second-order Levenberg–Marquardt (LM)algorithm was adopted for neural network trainingby Hagan and Menhaj. The LM algorithmuses significantly more number of parameters describing the error surface than just gradient elements as in the EBP algorithm.

As a consequence, the LM algorithm is not onlyfast but it can also train neural networks for which the EBPalgorithm has difficulty in converging .

THE LEVENBERG-MARQUARDT ALGORITHM The Levenberg-Marquardt (LM)

algorithm is the most widely used optimization algorithm. Itoutperforms simple gradient descent and other conjugate gradient methods in a wide variety ofproblems.

THE LEVENBERG-MARQUARDT ALGORITHM(CONTD.)

The problem for which the LM algorithm provides a solution is called Nonlinear Least SquaresMinimization. This implies that the function to be minimized is of the following special form :

LM AS A BLEND OF GRADIENT DESCENT AND GAUSS-NEWTON ITERA-TION

Vanilla gradient descent is the simplest, most intuitive technique to find minima in a function. Parameter updating is performed by adding the negative of the scaled gradient at each step, i.e.

LM AS A BLEND OF GRADIENT DESCENT AND GAUSS-NEWTON ITERA-TION(CONTD.)

Simple gradient descent suffers from various convergence problems. Logically, we would like to take large steps down the gradient at locations where the gradient is small (the slope is gentle) and conversely, take small steps when the gradient is large, so as not to rattle out of the minima.

LM AS A BLEND OF GRADIENT DESCENT AND GAUSS-NEWTON ITERA-TION(CONTD.)

Another issue is that the curvature of the error surface may not be the same in all directions. For example, if there is a long and narrow valley in the error surface, the component of the gradient in the direction that points along the base of the valley is very small while the component along the valley walls is quite large.

This results in motion more in the direction of the walls even though we have to move a long distance along the base and a small distance along the walls.

This situation can be improved upon by using curvature as well as gradient information, namelysecond derivatives. One way to do this is to use Newton’s method to solve the equation ∇f (x)= 0.

DISADVANTAGES OF LM

It cannot be used for problems with many training patterns because the Jacobian matrix becomes prohibitively too large.

The LM algorithm requires the inversion of a quasi-Hessian matrix of size nw × nw in every iteration, where nw is the number of weights. Because of the necessity of matrix inversion in every iteration, the speed advantage of the LM algorithm over the EBP algorithm is less evident as the network size increases.

The Hagan and Menhaj LM algorithm was developedonly for multilayer perceptron (MLP) neural networksTherefore, much more powerful neural networks , such as fully connected cascade (FCC) or bridged multilayer preceptron architectures cannot be trained.

DISADVANTAGES OF LM(CONTD.)

In implementing the LM algorithm, Hagan and Men-haj calculated elements of the Jacobian matrix usingbasically the same routines as in the EBP algorithm.The difference is that the error backpropagation process(for Jacobian matrix computation) must be carried onnot only for every pattern but also for every outputseparately.

PROPOSAL

In this paper, the limitations 3) and 4) of the Hagan and Menhaj LM algorithm are addressed and the proposed method of computation allows it to train networks with arbitrarily connected neurons. This way, more complex feed-forward neural network architectures than MLP can be efficiently trained. A further advantage of the proposed algorithm is that the learning process requires only forward computation without the necessity of the backward computations. This way, the proposed method, in many cases, may also lead to the reduction of the computation time.

In order to preserve the generalization abilities of neural networks, the size of the networks should be as small as possible. The proposed algorithm partially addresses this problem because it allows training smaller networks with arbitrarily connected neurons.

DEFINITION OF BASIC CONCEPTS IN NEURAL NETWORK TRAINING

Let us consider neuron j with ni inputs, as shown below. If neuron j is in the first layer, all its inputs would be connected to the inputs of the network, otherwise, its inputs can be connected to outputs of other neurons or to networks inputs if connections across layers are allowed.

DEFINITION OF BASIC CONCEPTS IN NEURAL NETWORK TRAINING(CONTD.)

GRADIENT VECTOR AND JACOBIAN MATRIX COMPUTATION

For every pattern, in EBP algorithm only one backpropagation process is needed, while in second-order algorithms the backpropagation process has to be repeated for every output separately in order to obtain consecutive rows of the Jacobian Matrix

GRADIENT VECTOR AND JACOBIAN MATRIX COMPUTATIONFrom the following Fig. it can be noticed that, for every pattern p,there are no rows of Jacobian matrix where no is the number of network outputs. The number of columns is equal to number of weights in the networks and the number of rows is equal to np × no.


In EBP algorithm, output errors are parts of the δ parameter

In second-order algorithms, the δ parameters are calculated for each neuron j and each output m separately. Also, in the backpropagation process , the error is replaced by a unit value


In the EBP algorithm, elements of gradient vector are computed as

where δj is obtained with error backpropagation process. In second-order algorithms, gradient can be obtained from

partial results of Jacobian calculations

where m indicates a network output and δmj is given by (9).


The update rule of the EBP algorithm is

where n is the index of iterations, w is the weight vector, α is the learning constant, and g is the gradient vector.

Derived from the Newton algorithm and the steepest descent method, the update rule of the LM algorithm is

where μ is the combination coefficient, I is the identity matrix, and J is Jacobian matrix shown in Fig. 2.

FORWARD ONLY COMPUTATION

The proposed method is designed to improve the efficiencyof Jacobian matrix computation by removing the backpropaation process.

The notation of δk,j is extension of (9) and can be interpreted as signal gain between neurons j and k and it is given by

where k and j are the indices of neurons and Fk, j(y j) is the nonlinear relationship between the output node of neuron k and the output node of neuron j . Naturally, in feedforward, k ≥ j. If k = j ,then δk,k = sk ,where sk is the slope of activation function .


The matrix δ has a triangular shape, and its elements can be calculated in the forward-only process. Later, elements of gradient vector and elements of Jacobian can be obtained using (10) and (12), where only the last rows of matrix δ associated with network outputs are used.

The key issue of the proposed algorithm is the method of calculating of δk, j parameters in the forward calculation process

CALCULATION OF Δ MATRIX FOR FCC ARCHITECTURES


Let us start our analysis with fully connected neural networks (Fig. 5). Any other architecture could be considered as a simplification of fully connected neural networks by eliminating connections (setting weights to zero). If the feedforward principle is enforced (no feedback), fully connected neural networks must have cascade architectures.


For the first neuron, there is only one δ parameterδ1,1 = s1.

For the second neuron, there are two δ parameters δ2,2 = s2

δ2,1 = s2w1,2s1. The universal formula to calculate the δk, j parameters using already

calculated data for previous neurons is

wherein the feedforward network neuron j must be located before neuron k,so k ≥ j ; δk,k = sk is the slop of activation function of neuron k; wj,k is weight between neuron j and neuron k and δk, j is a signal gain through weight wj,k andthrough the other part of network connected to wj,k .


In order to organize the process, the nn × nn computation table is used for calculating signal gains between neurons, where nn is the number of neurons (Fig. 7).

Natural indices(from 1 to nn) are given for each neuron according to the direction of signals propagation.

For signal gains computation, only connections between neurons need to be concerned, while the weights connected to network inputs and biasing weights of all neurons will be used only at the end of the process.

For a given pattern, a sample of the nn × nn computation table is shown in Fig. 7. The indices of rows and columns

are the same as the indices of neurons.


The computation table consists of three parts:

weights between neurons in upper triangle; vector of slopes of activation functions in main diagonal signal gain matrix δ in lower triangle

Only main diagonal and lower triangular elements are computed for each pattern. Initially, elements on main diagonal δk,k = sk are known, as slopes of the activation functions and values of signal gains δk, j are being computed subsequently using (22).


The computation is processed neuron by neuron starting with the neuron closest to network inputs.

At first, the row no. 1 is calculated and then elements of the subsequent rows.

Calculation on the row below is done using elements from the rows above using (22). After completion of the forward computation process, all elements of the δ matrix in the form of the lower triangle are obtained.


In the next step, elements of gradient vector and Jacobian matrix are calculated using (10) and (12).

Then for each pattern, the three rows of the Jacobian matrix, corresponding to three outputs, are calculated in one step using (10) without additional propagation of δ


The proposed method gives all the information needed to calculate both the gradient vector (12) and the Jacobian matrix (10) without the backpropagation process, instead, the δ parameters are obtained in relatively simple forward computation [see (22)].

In order to further simplify the computation process, (22) is completed in two steps

ALGORITHM

COMPARISON OF THE TRADITIONAL ANDTHE PROPOSED ALGORITHM

COMPUTATIONAL SPEED

ASCII Codes to Image Conversion

COMPUTATIONAL SPEED

Error Correction:

CONCLUSION

It can be easily adapted to train arbitrarily connected neural networks and not just MLP topologies. This is very important because neural networks with connections across layers are much more powerful than commonly used MLP architectures

it is capable of training neural networks with reduced number of neurons and as consequence has good generalization abilities

The proposed method of computation gives identical num-ber of training iterations and success rates as the Haganand Menhaj implementation of the LM algorithm, since thesame Jacobian matrices are obtained from both methods.

REFERENCES

P. J. Werbos, “Back-propagation: Past and future,” in Proc. IEEE Int. Conf. Neural Netw., vol. 1. San Diego, CA, Jul. 1988, pp. 343–353.

B. M. Wilamowski, N. J. Cotton, O. Kaynak, and G. Dundar, “Comput-ing gradient vector and Jacobian matrix in arbitrarily connected neural networks,” IEEE Trans. Ind. Electron., vol. 55, no. 10, pp. 3784–3790,Oct. 2008.

N. Ampazis and S. J. Perantonis, “Two highly efficient second-order algorithms for training feedforward networks,” IEEE Trans. Neural Netw., vol. 13, no. 5, pp. 1064–1074, Sep. 2002.

C.-T. Kim and J.-J. Lee, “Training two-layered feedforward networks with variable projection method,” IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 371–375, Feb. 2008.

B. M. Wilamowski, “Neural network architectures and learning algoithms: How not to be frustrated with neural networks,” IEEE Ind. Electron. Mag., vol. 3, no. 4, pp. 56–63, Dec. 2009.

S. Ferrari and M. Jensenius, “A constrained optimization approach to preserving prior knowledge during incremental training,” IEEE Trans. Neural Netw., vol. 19, no. 6, pp. 996–1009, Jun. 2008.

Q. Song, J. C. Spall, Y. C. Soh, and J. Ni, “Robust neural network tracking controller using simultaneous perturbation stochastic approximation,” IEEE Trans. Neural Netw., vol. 19, no. 5, pp. 817–835, May 2008.

REFERENCES

V. V. Phansalkar and P. S. Sastry, “Analysis of the back-propagation algorithm with momentum,” IEEE Trans. Neural Netw., vol. 5, no. 3, pp. 505–506, May 1994.

M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm,” in Proc. Int. Conf. Neural Netw., San Francisco, CA, 1993, pp. 586–591.

A. Toledo, M. Pinzolas, J. J. Ibarrola, and G. Lera, “Improvement of the neighborhood based Levenberg–Marquardt algorithm by local adaptation of the learning coefficient,” IEEE Trans. Neural Netw., vol. 16, no. 4, pp. 988–992, Jul. 2005.

J.-M. Wu, “Multilayer Potts perceptrons with Levenberg–Marquardt learning,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2032–2043, Dec. 2008.

M. T. Hagan and M. B. Menhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 989–993, Nov. 1994.

H. B. Demuth and M. Beale, Neural Network Toolbox: For Use with MATLAB. Natick, MA: Mathworks, 2000.

M. E. Hohil, D. Liu, and S. H. Smith, “Solving the N-bit parity problem using neural networks,” Neural Netw., vol. 12, no. 9, pp. 1321–1323,Nov. 1999.

B. M. Wilamowski, D. Hunter, and A. Malinowski, “Solving parity- N problems with feedforward neural networks,” in Proc. IEEE IJCNN,Piscataway, NJ: IEEE Press, 2003, pp. 2546–2551.

B. M. Wilamowski and H. Yu, “Improved computation for Levenberg–Marquardt training,” IEEE Trans. Neural Netw., vol. 21, no. 6, pp. 930–937, Jun. 2010.

neural network learning without backpropagation

Documents

ebp algorithm

menhaj lm algorithm

proposed algorithm

propagation algorithm

optimization algorithm

gradient elements

blend of gradient descent

simple gradient descent