efficient bp algorithms for general feedforward neural networks
TRANSCRIPT
Efficient BP Algorithms for General FeedforwardNeural Networks1
S. Espana Boquera, M.J. Castro Bleda,F. Zamora Martınez, J. Gorbe Moya
Dep. Sistemas Informaticos y ComputacionUniversidad Politecnica de Valencia, Spain
18-21 Jun 2007, Murcia, Spain
1Work partially supported under contracts TIN2006-12767 and GVA06/302.Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 1 / 37
Index
1 Introduction and motivation
2 Preprocessing the ANN: The Consecutive Retrieval Problem
3 Tests of efficiencyEfficiency with general feedforward topologies
4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application
5 Conclusions and future work
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 2 / 37
Introduction and motivation
The BackPropagation (BP) algorithm is one of the most widelyused supervised learning techniques to train feedforward ArtificialNeural Networks (ANNs).There are many variations of the BP algorithm.What is usually required:
Specify general topologies.Good data description facilities. Not always practical or evenpossible to provide a set of input and output pairs.Fast implementation. Since training require many cpu time, arelative improvement (speed-up) makes a difference.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 3 / 37
Introduction and motivation
The BackPropagation (BP) algorithm is one of the most widelyused supervised learning techniques to train feedforward ArtificialNeural Networks (ANNs).There are many variations of the BP algorithm.What is usually required:
Specify general topologies.Good data description facilities. Not always practical or evenpossible to provide a set of input and output pairs.Fast implementation. Since training require many cpu time, arelative improvement (speed-up) makes a difference.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 4 / 37
Introduction and motivation
Specify general topologies.Example: ngram language model estimation
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 5 / 37
Introduction and motivation
The BackPropagation (BP) algorithm is one of the most widelyused supervised learning techniques to train feedforward ArtificialNeural Networks (ANNs).There are many variations of the BP algorithm.What is usually required:
Specify general topologies.Good data description facilities. Not always practical or evenpossible to provide a set of input and output pairs.Fast implementation. Since training require many cpu time, arelative improvement (speed-up) makes a difference.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 6 / 37
Introduction and motivation
Good data description facilitiesExample: neural convolution kernel filter
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 7 / 37
Introduction and motivation
The BackPropagation (BP) algorithm is one of the most widelyused supervised learning techniques to train feedforward ArtificialNeural Networks (ANNs).There are many variations of the BP algorithm.What is usually required:
Specify general topologies.Good data description facilities. Not always practical or evenpossible to provide a set of input and output pairs.Fast implementation. Since training require many cpu time, arelative improvement (speed-up) makes a difference.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 8 / 37
Introduction and motivation
Fast implementation
BP algorithm implementations: Tradeoff between efficiency andflexibility:
Specialized topologies (e.g.: layered)Connection weights stored in matricesBetter data locality and simplified data access.
General feedforward topologiesList of neurons in topological orderIncreased flexibility at expense of data locality.
→ Our main aim was to obtain a BP implementation which is as fastas specialized BP algorithms with arbitrary feedforward topologies.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 9 / 37
Introduction and motivation
In this work, we present an efficient implementation of the BPalgorithm to train general feedforward ANNs with the followingfeatures:
Incremental mode → faster training than batch mode.Momentum: Adding a momentum term allows a network to respondnot only to the local gradient but also to recent trends in the errorsurface.Weight decay: effect similar to a prunning algorithm.Softmax, sigmoid/logistic, tanh, linear. . . activation functionsTied and constant weights
This BP implementation is written in C++ and is part of a toolkitnamed April (A Pattern Recognizer in Lua) which also providesmany data description facilities and other functionalities (HMMs,DTW, clustering, etc.)
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 10 / 37
Index
1 Introduction and motivation
2 Preprocessing the ANN: The Consecutive Retrieval Problem
3 Tests of efficiencyEfficiency with general feedforward topologies
4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application
5 Conclusions and future work
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 11 / 37
Preprocessing the ANN
The bottleneck in the simulation of a big ANN is the dot product ofthe inputs −→x and the weights −→w of each neuron for feedforwardand traversal of connections for backpropagation.Preprocessing of the network topology.
Given a neuron, the activation values of its predecessors is aconsecutive subvector which can be efficiently traversed.Therefore, the product −→x · −→w of each neuron, as well as thebackpropagation of the error, are improved (cheaper iteration andbetter locality).This property cannot be guaranteed for general topologies → someneuron values may need to be duplicated.The operations for feedforward and backpropagation are “compiled”into a sequence of actions to be performed.
→ Consecutive retrieval problem.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 12 / 37
Preprocessing the ANN
Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 13 / 37
Preprocessing the ANN
Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 14 / 37
Preprocessing the ANN
Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 15 / 37
Preprocessing the ANN
Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 16 / 37
Preprocessing the ANN
Consecutive retrieval problemLet X be a set, having P subsets C1, C2, . . . , CP . The goal is to obtaina sequence A = a1, . . . , ak of elements of X so that every Ci appearsin A as a contiguous subsequence, while keeping the length of A assmall as possible.
It is proven to be a NP-Complete problem.A greedy algorithm which achieves packing rates very superior toa “naive” consecutive arrangement has been developed.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 17 / 37
Index
1 Introduction and motivation
2 Preprocessing the ANN: The Consecutive Retrieval Problem
3 Tests of efficiencyEfficiency with general feedforward topologies
4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application
5 Conclusions and future work
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 18 / 37
Tests of efficiency
Comparison with Stuttgart Neural Network Simulator (SNNS)Handwritten digits classification (16× 16 B/W pixels).The corpus consists of 1 000 digits and is stored in a uniqueimage in PNG format.256 input neurons (the size of a image), 10 output neurons.
......
......
......
......
......
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 19 / 37
Experiment set-up
Layered feedforward networks with one and two hidden layerscontaining between 10 and 200 neurons.
sigmoid activation functions10 training epochs
AMD Athlon (1 333 MHz) with 384 MB of RAM under Linux.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 20 / 37
Analysis of the efficiency results
0
1
2
3
4
5
6
7
8
9
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Tim
e/ep
och
(sec
/epo
ch)
W
SNNSApril
Temporal cost by epoch (sec./epoch) of April and SNNS
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 21 / 37
Analysis of the efficiency results
0
2
4
6
8
10
12
14
16
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Tim
e co
st S
NN
S (
sec/
epoc
h) /
Tim
e co
st A
pril
(sec
/epo
ch)
W
Ratio between temporal costs of April and SNNS
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 22 / 37
Analysis of the efficiency results
The analysis tool valgrind has been used to analyse thenumber of cache misses in training networks with different numberof weights W .
April SNNSW # Accesses L1 L1&L2 # Accesses L1 L1&L2
misses misses misses misses2 790 3.10×108 0.14% 0.10% 4.45×108 4.46% 0.05%
25 120 1.70×109 1.71% 0.03% 2.43×109 7.39% 4.81%
62 710 4.25×109 1.85% 1.11% 6.00×109 7.18% 5.63%
“L1 misses” is the percentage of data accesses which result in L1cache misses,“L1&L2 misses” shows the percentage of data accesses whichmiss at both L1(fastest) and L2(slightly slower) cache, resulting ina slow access to main memory.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 23 / 37
Efficiency with general feedforward topologies
Figure: Feedforward network with shortcuts between the layers: a neuron isconnected to neurons of all the previous layers.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 24 / 37
Efficiency with general feedforward topologies
Figure: Feedforward network with segmented input: the whole 16× 16 imageis divided in four 8× 8 fragments forming four groups of hidden neurons.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 25 / 37
Efficiency with general feedforward topologies
Other topologies tested in this work:Each layer connected to all previous layersSegmented 16× 16 input in four 8× 8 pixel fragments.
Training these topologies with April and SNNS has givenanaloguous time results as before.April is able to train efficiently general feedforward topologies aswell as specific feedforward topologies.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 26 / 37
Index
1 Introduction and motivation
2 Preprocessing the ANN: The Consecutive Retrieval Problem
3 Tests of efficiencyEfficiency with general feedforward topologies
4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application
5 Conclusions and future work
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 27 / 37
Additional Features of the BP implementation
Softmax activation functionnumerical instability problemspossibility of grouping output neurons so that the softmaxcalculations are performed independently in each group
Tied and constant weights.Weight decay.Fixed point version of feedforward for embedded systems.Reproducibility of experiments.
Ability to stop and resume experiments.Useful for process migration and grid computing
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 28 / 37
Data description and manipulation facilities
Lua is an extensible procedural embedded programminglanguage especially designed for extending and customizingapplications with powerful data description facilities.Besides the Lua description facilities, April adds the matrix anddataset classes which allow the definition and manipulation ofpossibly huge sets of samples in way easier and more flexiblethan simply enumerating the pairs of inputs and outputs.
Iterators over n-dimensional matrices.Combination of datasets: slicing, union, indexing, etc.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 29 / 37
Example
First, the image is loaded in a matrix and later a datasetcontaining the 10× 100 samples of size 16× 16 pixel values isgenerated from it.
samples = matrix.loadImage("digits.png")input_data = dataset.matrix(samples, {
patternSize= {16,16}, -- sample sizeoffset = {0,0}, -- initial window positionnumSteps = {100,10}, -- #steps in each directionstepSize = {16,16}, -- step sizeorderStep = {1,0} -- step direction
})
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 30 / 37
Example
Later, the matrix [1 0 0 0 0 0 0 0 0 0] is iterated cicularly in orderto obtain the dataset for the associated desired output.
m2 = matrix(10,{1,0,0,0,0,0,0,0,0,0})output_data = dataset.matrix(m2, {
patternSize= {10},offset = {0},numSteps = {input_data:numPatterns()},circular = {true},-- circular datasetstepSize = {-1}
})
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 31 / 37
Example
The corresponding training, validation and test input and outputdatasets are obtained by slicing the former datasets.
train_input =dataset.slice(input_data , 1, 600)train_output =dataset.slice(output_data, 1, 600)validation_input =dataset.slice(input_data ,601, 800)validation_output=dataset.slice(output_data,601, 800)test_input =dataset.slice(input_data ,801,1000)test_output =dataset.slice(output_data,801,1000)
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 32 / 37
Example
Although more complex ANN can be described, for layered ANNsit is possible to give a simple description like “256 inputs 30logistic 10 linear”.
rnd = random(1234) -- pseudo-random generatorthe_net=mlp.generate("256 inputs "..
"30 logistic ".."10 linear",rnd, -0.7, 0.7)
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 33 / 37
Example
for i=1,100 domse_train = the_net:train {
learning_rate = 0.2,momentum = 0.2,input_dataset = train_input,output_dataset= train_output,shuffle = rnd
}mse_val = the_net:validate {
input_dataset = validation_input,output_dataset= validation_output
}printf ("Cycle %3d MSE %f %f\n",i,mse_train,mse_val)
endmse_test = the_net:validate {input_dataset = test_input,output_dataset = test_output
}printf ("MSE of the test set: %f\n",mse_test)
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 34 / 37
Index
1 Introduction and motivation
2 Preprocessing the ANN: The Consecutive Retrieval Problem
3 Tests of efficiencyEfficiency with general feedforward topologies
4 Additional Features of the BP implementationData description and manipulation facilitiesExample of use of the application
5 Conclusions and future work
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 35 / 37
Conclusions
April toolkit is up to 16 times faster than SNNS. In addition, itscapacity to train general feedforward networks does not decreaseits efficiency due to the use of data structures with a greatmemory locality instead of linked lists (as SNNS).An approximation algorithm for the NP-Complete consecutiveretrieval problem has been designed in order to mantain thisefficiency.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 36 / 37
Future work
Nearly finished:SSE implementation → Streaming SIMD Extensions for IntelPentium architecture
Considered extensions:GPU implementation → General-Purpose GPU ComputingRecurrent networks. This type of networks has demonstrated tobe very useful in diverse fields, like in Natural LanguageProcessing.Graphical interface. Adding a graphical interface could orient theapplication towards a didactic use.
Espana, Castro, Zamora, Gorbe (DSIC) IWINAC 2007 May 2007 37 / 37