efficient bp algorithms for general feedforward neural networks

Efficient BP Algorithms for General FeedforwardNeural Networks1

S. Espana Boquera, M.J. Castro Bleda,F. Zamora Martınez, J. Gorbe Moya

Dep. Sistemas Informaticos y ComputacionUniversidad Politecnica de Valencia, Spain

18-21 Jun 2007, Murcia, Spain

1Work partially supported under contracts TIN2006-12767 and GVA06/302.Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 1 / 37

Index

1 Introduction and motivation

2 Preprocessing the ANN: The Consecutive Retrieval Problem

3 Tests of efficiencyEfficiency with general feedforward topologies

4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application

5 Conclusions and future work

Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 2 / 37

Introduction and motivation

The BackPropagation (BP) algorithm is one of the most widelyused supervised learning techniques to train feedforward ArtificialNeural Networks (ANNs).There are many variations of the BP algorithm.What is usually required:

Specify general topologies.Good data description facilities. Not always practical or evenpossible to provide a set of input and output pairs.Fast implementation. Since training require many cpu time, arelative improvement (speed-up) makes a difference.



Specify general topologies.Example: ngram language model estimation



Good data description facilitiesExample: neural convolution kernel filter



Fast implementation

BP algorithm implementations: Tradeoff between efficiency andflexibility:

Specialized topologies (e.g.: layered)Connection weights stored in matricesBetter data locality and simplified data access.

General feedforward topologiesList of neurons in topological orderIncreased flexibility at expense of data locality.

→ Our main aim was to obtain a BP implementation which is as fastas specialized BP algorithms with arbitrary feedforward topologies.



In this work, we present an efficient implementation of the BPalgorithm to train general feedforward ANNs with the followingfeatures:

Incremental mode → faster training than batch mode.Momentum: Adding a momentum term allows a network to respondnot only to the local gradient but also to recent trends in the errorsurface.Weight decay: effect similar to a prunning algorithm.Softmax, sigmoid/logistic, tanh, linear. . . activation functionsTied and constant weights

This BP implementation is written in C++ and is part of a toolkitnamed April (A Pattern Recognizer in Lua) which also providesmany data description facilities and other functionalities (HMMs,DTW, clustering, etc.)


Index







Preprocessing the ANN

The bottleneck in the simulation of a big ANN is the dot product ofthe inputs −→x and the weights −→w of each neuron for feedforwardand traversal of connections for backpropagation.Preprocessing of the network topology.

Given a neuron, the activation values of its predecessors is aconsecutive subvector which can be efficiently traversed.Therefore, the product −→x · −→w of each neuron, as well as thebackpropagation of the error, are improved (cheaper iteration andbetter locality).This property cannot be guaranteed for general topologies → someneuron values may need to be duplicated.The operations for feedforward and backpropagation are “compiled”into a sequence of actions to be performed.

→ Consecutive retrieval problem.



Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.




It is proven to be a NP-Complete problem.A greedy algorithm which achieves packing rates very superior toa “naive” consecutive arrangement has been developed.


Index







Tests of efficiency

Comparison with Stuttgart Neural Network Simulator (SNNS)Handwritten digits classification (16× 16 B/W pixels).The corpus consists of 1 000 digits and is stored in a uniqueimage in PNG format.256 input neurons (the size of a image), 10 output neurons.

......

......

......

......

......


Experiment set-up

Layered feedforward networks with one and two hidden layerscontaining between 10 and 200 neurons.

sigmoid activation functions10 training epochs

AMD Athlon (1 333 MHz) with 384 MB of RAM under Linux.


Analysis of the efficiency results

0

1

2

3

4

5

6

7

8

9

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Tim

e/ep

och

(sec

/epo

ch)

W

SNNSApril

Temporal cost by epoch (sec./epoch) of April and SNNS



0

2

4

6

8

10

12

14

16

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Tim

e co

st S

NN

S (

sec/

epoc

h) /

Tim

e co

st A

pril

(sec

/epo

ch)

W

Ratio between temporal costs of April and SNNS



The analysis tool valgrind has been used to analyse thenumber of cache misses in training networks with different numberof weights W .

April SNNSW # Accesses L1 L1&L2 # Accesses L1 L1&L2

misses misses misses misses2 790 3.10×108 0.14% 0.10% 4.45×108 4.46% 0.05%

25 120 1.70×109 1.71% 0.03% 2.43×109 7.39% 4.81%

62 710 4.25×109 1.85% 1.11% 6.00×109 7.18% 5.63%

“L1 misses” is the percentage of data accesses which result in L1cache misses,“L1&L2 misses” shows the percentage of data accesses whichmiss at both L1(fastest) and L2(slightly slower) cache, resulting ina slow access to main memory.


Efficiency with general feedforward topologies

Figure: Feedforward network with shortcuts between the layers: a neuron isconnected to neurons of all the previous layers.



Figure: Feedforward network with segmented input: the whole 16× 16 imageis divided in four 8× 8 fragments forming four groups of hidden neurons.



Other topologies tested in this work:Each layer connected to all previous layersSegmented 16× 16 input in four 8× 8 pixel fragments.

Training these topologies with April and SNNS has givenanaloguous time results as before.April is able to train efficiently general feedforward topologies aswell as specific feedforward topologies.


Index







Additional Features of the BP implementation

Softmax activation functionnumerical instability problemspossibility of grouping output neurons so that the softmaxcalculations are performed independently in each group

Tied and constant weights.Weight decay.Fixed point version of feedforward for embedded systems.Reproducibility of experiments.

Ability to stop and resume experiments.Useful for process migration and grid computing


Data description and manipulation facilities

Lua is an extensible procedural embedded programminglanguage especially designed for extending and customizingapplications with powerful data description facilities.Besides the Lua description facilities, April adds the matrix anddataset classes which allow the definition and manipulation ofpossibly huge sets of samples in way easier and more flexiblethan simply enumerating the pairs of inputs and outputs.

Iterators over n-dimensional matrices.Combination of datasets: slicing, union, indexing, etc.


Example

First, the image is loaded in a matrix and later a datasetcontaining the 10× 100 samples of size 16× 16 pixel values isgenerated from it.

samples = matrix.loadImage("digits.png")input_data = dataset.matrix(samples, {

patternSize= {16,16}, -- sample sizeoffset = {0,0}, -- initial window positionnumSteps = {100,10}, -- #steps in each directionstepSize = {16,16}, -- step sizeorderStep = {1,0} -- step direction

})


Example

Later, the matrix [1 0 0 0 0 0 0 0 0 0] is iterated cicularly in orderto obtain the dataset for the associated desired output.

m2 = matrix(10,{1,0,0,0,0,0,0,0,0,0})output_data = dataset.matrix(m2, {

patternSize= {10},offset = {0},numSteps = {input_data:numPatterns()},circular = {true},-- circular datasetstepSize = {-1}

})


Example

The corresponding training, validation and test input and outputdatasets are obtained by slicing the former datasets.

train_input =dataset.slice(input_data , 1, 600)train_output =dataset.slice(output_data, 1, 600)validation_input =dataset.slice(input_data ,601, 800)validation_output=dataset.slice(output_data,601, 800)test_input =dataset.slice(input_data ,801,1000)test_output =dataset.slice(output_data,801,1000)


Example

Although more complex ANN can be described, for layered ANNsit is possible to give a simple description like “256 inputs 30logistic 10 linear”.

rnd = random(1234) -- pseudo-random generatorthe_net=mlp.generate("256 inputs "..

"30 logistic ".."10 linear",rnd, -0.7, 0.7)


Example

for i=1,100 domse_train = the_net:train {

learning_rate = 0.2,momentum = 0.2,input_dataset = train_input,output_dataset= train_output,shuffle = rnd

}mse_val = the_net:validate {

input_dataset = validation_input,output_dataset= validation_output

}printf ("Cycle %3d MSE %f %f\n",i,mse_train,mse_val)

endmse_test = the_net:validate {input_dataset = test_input,output_dataset = test_output

}printf ("MSE of the test set: %f\n",mse_test)


Index







Conclusions

April toolkit is up to 16 times faster than SNNS. In addition, itscapacity to train general feedforward networks does not decreaseits efficiency due to the use of data structures with a greatmemory locality instead of linked lists (as SNNS).An approximation algorithm for the NP-Complete consecutiveretrieval problem has been designed in order to mantain thisefficiency.


Future work

Nearly finished:SSE implementation → Streaming SIMD Extensions for IntelPentium architecture

Considered extensions:GPU implementation → General-Purpose GPU ComputingRecurrent networks. This type of networks has demonstrated tobe very useful in diverse fields, like in Natural LanguageProcessing.Graphical interface. Adding a graphical interface could orient theapplication towards a didactic use.


efficient bp algorithms for general feedforward neural networks

Technology

gorbe dsic iwinac

castro bleda

zamora martnez

gorbe moyadep

arelative improvement

set of input

output pairs

cpu time