[ieee 2012 international conference on ict convergence (ictc) - jeju, korea (south)...

A Framework for Multiprocessor Neural Networks Systems

Mumtazimah Mohamad, Md Yazid Mohamad Saman and Muhammad Suzuri Hitam Department of Computer Science, Faculty of Science & Technology

University Malaysia Terengganu Kuala Terengganu, Terengganu, Malaysia

[email protected], {yazid, [email protected]}

Abstract—Artificial neural networks (ANN) are able to simplify classification tasks and have been steadily improving both in accuracy and efficiency. However, there are several issues that need to be addressed when constructing an ANN for handling different scales of data, especially those with a low accuracy score. Parallelism is considered as a practical solution to solve a large workload. However, a comprehensive understanding is needed to generate a scalable neural network that is able to achieve the optimal training time for a large network. Therefore, this paper proposes several strategies, including neural ensemble techniques and parallel architecture, for distributing data to several network processor structures to reduce the time required for recognition tasks without compromising the achieved accuracy. The initial results indicate that the proposed strategies are able to improve the speed up performance for large scale neural networks while maintaining an acceptable accuracy.

Keywords�Artificial neural networks; back propagation; ensemble; multiprocessor; parallel.

I.� INTRODUCTION Pattern recognition involves the sequential modelling of the

nature of patterns, the extraction of pattern features as well as the association, storage, classification and clustering of the features. Artificial Neural Networks (ANN) have greatly assisted recognition and classification works in various fields for decades. ANN tasks can be applied to various fields such as function approximation, regression analysis, classification or recognition, data processing, control application and robotics. ANNs‘ biggest contribution is their ability to serve as an arbitrary function approximation mechanism that is able to learn from the observed data [1]. However, the use of ANNs is not straightforward. Therefore, the understanding of the basic theory is very important. Problem solving using ANN should be based on the diversity of the data that are used as the input to improve the learning ability for recognition and classification tasks.

A stand-alone ANNs inefficient because its computation is time consuming and the output is not unique [2]. Previous research proves that the combination of neural network components appear to be a natural extension of the neural modelling paradigm [1] and is an effective method [3] to improve ANN computation. Owing to these advantages, further understanding of combination of the method for ANN is useful and leading to the point of interest which is to discuss the scalable multiple network strategy. This study aims to reduce

the computation time and improve the ability of ANN by using a combination of multiple parallel networks.

II.� RELATED WORKS A soft network architecture, instead of a special hardware

implementation, is suitable to reduce the time required for machine learning in classification tasks. ANN learning methods are extensively utilized for pattern recognition and classification tasks and are found to be able to improve both accuracy and performance. Back propagation (BP) is a popular incremental learning scheme for ANN where it has been proven as the most universal approximation technique by Rumelhart (1986) [4]. The motivations for using BP are its simplicity, hardware availability, reliability [5] and its versatility as a “universal approximator” [1]. A BP scheme is a type of generalized delta rule for non linear functions [6] that can easily adapt to the data by iteratively minimizing the weights until the error signal achieves its target. The greater the number of patterns, the better the training net will be. However, if the related epochs are too low, the scheme does not allow the network to learn. On the other hand, if there are too many epochs, it can lead to either over training or over adjustment.

Many excellent results have proved the advantages of the BP scheme. However, the scheme can be further improved when used for managing large datasets, usually used for identification purposes, without loss of accuracy. Online BP may lead to thousands of epochs with a large multilayer perceptron (MLP) network and consume up to months of processing time [7, 8]. The ability of the BP scheme to manage larger datasets therefore can be questioned as they need more training which as a result, increases the processing time required. Therefore, to solve the large data case, the use of multiple processors is proposed in order to distribute an ANN computation while maintaining the ANN’s connectionism principle.

An important issue in a parallel ANN task is to ensure that the multiple processors that are running the classical BP scheme can be approximated to perform as a single processor [9]. The use of multiple processors for an ANN may help to reduce the training time compared to if sequential training is used [10, 11]. Using multiple processors also decreases the computation of an ANN by distributing it over a set of computers [8, 12, 13]. Most of the previous research have highlighted the success of simultaneous training tasks but have not not shown a clear connection between the accuracy

44978-1-4673-4828-7/12/$31.00 ©20122 IEEE ICTC 2012

Input: original dataset DS Output: The new training subsets (Tr

1, Tr

2,...Tr

n)

for t=1 to N for i=1 to N RandRow =P*rand() if RandRow <= P Pt (i, all columns)= DS(randRow,AllColumns) End if Next i Next i Output the final training subsets (Tr

1, Tr

2,...Tr

n)

obtained and the scalability of the ANN. Parallel ANNs create a trade-off between the achieved accuracy and the computation time required . The divide and conquer method implemented in parallel computing reduces the accuracy level as the number of processors used gets bigger. In cases that involve learning from a large data, the use of a single classifier is not suitable because it will take time to process and may require the use of special hardware. It is believed that a better learning strategy will lead to an increased level of performance.

The ensemble technique is proposed in order to preserve the capability of the BP scheme. Kittler et.al(1998) in [3] recognizes the need for a theoretical framework to describe combinations of classifiers (or network) namely the parallel combinations of classification as used in an ensemble. The use of a single ANN will lead to an unstable learner. In fact it is sensitive to the initial conditions and performs differently, , depending on the training data used. Furthermore, the architecture of an ANN itself is determined by trial and error and is not unique. Thus, integrating different neural networks using an ensemble strategy is an elegant and effective way to solve the variability yielded by the network’s output [14] and it is rather easy to implement [2]. The idea of ANN classifier combination is to form an integrated ANN classifier system that increases the performance of classification result compared to single ANN. This neural modelling paradigm, consists of neurons, layers and networks [26], with additional architectures similar to the specialization of functions in the brain [20]. The motivation of this study is to formulate a scalable learning approach that can accurately recognize patterns and compare its performance with a single network and other multi networks. Additionally, most research in this field focus on solutions that are based on its feature extraction recognition and not from the original image pixels.

III.� MULTI-PROCESSOR ANNS The underlying idea of a multiple neural network system is

the utilization of the valuable information from neural networks. There are several important requirements for both ensemble members and ensemble strategies for them to be considered to ensure that they can achieve high classification performance. Each individual network should have enough training data and each of the members of the ensemble must have a complementary set of classifier [20]. The number of training for each pattern and the network size are the two important factors in measuring the performance of neural networks [2]. The proposed approach is a promising design for the structure of an error-independent ensemble ANN that uses a set of parallel processors.

There are five phases in the learning framework proposed for the parallel classification process. As shown in Fig. 1, the input data samples that represent large data set are divided into different data subsets with respect to the number of processors. Next, each data subset is assigned to an individual ANN on different processors for a simultaneous training. Accordingly, each result of individual ANN classification results will be integrated by using an aggregate ensemble strategy. The description of techniques used in the framework as in Fig. 1 will be discussed in the following sub section.

Y(x)

Combination module

ANN1 ANN2 ANNk

Y1(x1) Y2(x2) Yk(xk)

Input module

x = {x1, x2, …, xn}

Figure 1. � Overall framework on Ensemble Multiprocessors

A.� Partitioning the Original Dataset The dataset partitioning process involves partitioning the original dataset DS into training data sets Tr1, Tr2, ... Tn using the bagging algorithm in order to ensure sufficient training data. A bagging sampling algorithm, adapted from [3] as shown in Fig. 2, is used to ensure that different samples with different training data subsets are selected [15]. The bagging algorithm is widely used for data sampling with replacement and efficiently constructs a reasonable size of training data.

Figure 2. � Bagging Algorithm

B.� Handover Task to Parallel Processors The partitioned datasets in phase 1 are passed to n

processors via Message Passing Interface (MPI). MPI is a standard application of parallelism achieved using message-passing paradigm. It provides portability and flexibility in both the management and implementation of parallel algorithms with optimized code for all parallel architectures [11]. As shown in (1), processor q presents the network with the partitioned datasets Trn of a batch of training patterns which is called a component gradient as in [10]:

C.� ANN Learning For an Individual ANN The ANN networks should be different from each other

where each of them represent the diversity feature [16]. Therefore, there are a few methods that can be used by each of the ensemble members to make different errors. Those methods are: a) different initial conditions, b) different network

��

(1)

45

architectures, c) different training data and d) different training algorithm. The first method may involve an ANN initialization with random weights, the learning rate and the momentum or the combination of them. The second method is to vary the network architecture by changing the number of hidden layers or number of hidden nodes to set up a each of ANN with different architecture. The next method is used to diversify the training data. The resampling or pre-processing data methods such as bagging, noise injection, cross-validation, stacking, boosting and input decimation can be used. The last method is by forcing each individual ANN to run with a different ANN learning approach from each other. Each ANN can occupy any learning approach such as feed forward ANN [2] RProp training [17], QuickProp [18], conjugate gradient [19], recurrent network [20], Hopfield network [21] or Self Organizing-Map [4].

In this paper, the third technique is adapted. The use of different training data can easily create a diverse subset because the output of each BP is not unique [2]. The use of a hyperbolic tangent transfer function in the hidden layer of the ANN can approximate mapping between the network’s input and output [22]. An individual ANN with different datasets partitions, Tr1, Tr2, ... Trn are trained with BP algorithm. The final output of the feed forward algorithm, used as the first phase of the BP algorithm, are based on Werbos(1974) as cited in [6]:

The function error is defined as:

where t is the output target while y is the desired output. In the second phase of the BP algorithm, a gradient descent is performed based on (3) to locate the change in magnitude. Equation (4) shows that the value � is between 0 and 1 and is a learning parameter that controls the convergence of the algorithm. The total squared error as in (3) is propagated back from the output layer to the hidden and input layers. The weight changes for input and hidden layer and their threshold are determined using (4).

Then the two phases of the BP scheme are repeated to minimize the error function until E converges to a minimum value. In this case, a value 0.001 of sum square error and 10000 iterations as a maximum value. F3 in (2) is the number of output y, used to indicate whether the ensemble classifiers are sufficiently reliable for integrating ensemble members which represented by each individual ANN.

D.� Transforming into Reliability Value The selected ANN classifier output value for each

processor is rescaled to have zero mean and standard deviation. This process is called reliability value transformation. In this case, the interval value is between [-1, 1] as in (5) because the hyperbolic function helps the network to converge faster than the sigmoid function when used to approximate the network’s learning [22]. Regarding these values, multiple ANN results are more easily integrated into an ensemble output as the following reliable value as in (5).

E.� Integrating Multiple ANNs This phase involves the master processor that locates the

aggregated output. An integration of multiple ANNs must be made to combine the output of the ensemble in the form of an aggregated output. A simple way to combine the output of the network is based on the ensemble average as in (6).

Others popular techniques are also used in other reported studies such as ranking and weighted averaging [14]. These techniques are used by the ensemble members to achieve a majority decision where they agree on a certain result regardless of the diversity and accuracy of all selected ANNs. However in this case, the multiple ANNs are correlated with each other, thus, (7) is used. The expected error will reduce as the number of ensemble members is increased after integrating the output [23].

IV.� RESULT AND DISCUSSION

In this section, a published UCI Iris plant dataset and a MNIST benchmark data for handwriting characters are used to test the proposed approach. The datasets are chosen in order to represent the varying size of an ANN. The framework for multiple ANN systems is tested using Message Passing Interface (MPI) tools. MPI is used to communicate between processors synchronously. All ANN classifiers are trained with different dataset partitions and parameters. For example, two ANNs work with the same hidden nodes setup, but have different learning rate and momentum, two classifiers using different hidden node setups and learning rate/momentum with resampling techniques for input and etc. The accuracy of the recognition rate obtained from the result is measured by the total corrected recognition. The performance and error reduction are calculated for a single ANN and parallel ANNs as in (8). The percentage of error reduction (PER) as in (9) is calculated to get the percentages of correctly classified

� � ��

�

��

�

�� (2)

! � "#��$ � � �

%

%�� (3)

&��

'( (4)

'''�� )* � )+*

)* , )+* (5)

�-./-012-�� "3.-456�7/

� �.-4��89:;<=>

.-4��

(6)

?@)ABC)DEE-E � !FGGHG I " , J I �3.-456�7/ � " KLHMNOPQR

(7)

patterns in the test data of a single network and the multiple ANNs in order to measure the result.

The result is analyzed to assess the performance of the

ensemble strategy. Fig. 3 shows the performance improvement provided by multiple ANNs, obtained by using (8). There is a considerable increase of performance shown by a single network for the Iris and MNIST datasets. Fig.4 shows the PER for various values of the ensemble. The PERs fall in the range of 24% to 58% for the Iris dataset and around 12% to 42% for the MNIST dataset. The performance of parallel ANNs shows a slight increase as the ensemble member grows. The result of simple average output shows that a pattern that has been correctly classified by a single network may not be correctly classified by the multiple ANNs when the ensemble strategy is used. The positive IP and PER values obtained from all the bigger processors show that the ensemble network performs better than the single network.

Figure 3. � The increase of performance

Figure 4. � The percentage of reduced errors

The overall performance of the parallelism approach is measured using the speedup factor shown in Fig. 5 and Fig. 6 for MNIST and Iris datasets respectively. The speedup factor for the Iris dataset scored less than 10% compared to the single computer network. In comparison, the speedup factor for the

MNIST dataset scored nearly 20% compared to the single computer network. The increase of the speedup factor for the MNIST dataset is considerably higher when the ensemble strategy is used. These results show that large distributed systems can help to accurately measure the systems’ performance in a diverse environment. The large data size in the test cases has a significant factor in the parallelization of ANN. The proposed integrated multiprocessor ANN system using ensemble strategy has a big potential and future research will aim to confirm whether the technique is able to process large datasets without compromising the accuracy of ANNs.

Figure 5. � Parallel speedup factor for MNIST dataset

Figure 6. � The speedup factor for Iris dataset

V.� CONCLUSIONS Ensemble strategy task for multiple ANNs is a very crucial

task as it is able to provide high performance with minimal execution time. It can be summarized, by using data decomposition, the performance a parallel ANN is faster compared to a single network processor for large data. Moreover, the ensemble strategy for multiprocessor ANN can lead to a better performance and error reduction compared to a single ANN classification. The proposed framework and reported results can hopefully provide a promising solution to other binary-class classification and recognition problems.

0 0.5

1 1.5

2 2.5

3 3.5

4 4.5

1 2 3 4 5 6

Incr

ease

of p

erfr

oman

ce

Processor(s)

MNIST

Iris

0 10 20 30 40 50 60 70

1 2 3 4 5 6

% e

rror

redu

ce

Processors(s)

MNIST

Iris

0

5

10

15

20

25

0 1 2 3 4 5 6

spee

up fa

ctor

(%)

Processor(s)

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6

spee

dup

fact

or s

cale

Processor(s)

ST � "UU I VVWV/�.X2-�-4 � 'VVWV-./-012-VVWVY@Z?�C)3)$ � (9)

[S � S)V\WVA]?^)-./-012- � 'S)V\WVA]?^)/�.X2-�-4 (8)

47

ACKNOWLEDGMENT We thank JASKOM of Faculty of Science & Technology,

University Malaysia of Terengganu, fellow postgraduates, and all publication support personnel who have provided helpful comments.

REFERENCES [1]� K. Hornik, "Approximation capabilities of multilayer feedforward

networks," Neural Networks, vol. 4, pp. 251-257, 1991. [2]� L. Yu, S. Wang and K. K. Lai, "Credit risk assessment with a multistage

neural network ensemble learning approach," Expert Systems with Applications, vol. 34, pp. 1434-1444, 2008.

[3]� M. W. Shields and M. C. Casey, "A theoretical framework for multiple neural network systems," Neurocomputing, vol. 71, pp. 1462-1476, 2008.

[4]� A. M. Kalteh, P. Hjorth and R. Berndtsson, "Review of the self-organizing map (som) approach in water resources: Analysis, modelling and application," Environmental Modelling & Software -Elsevier, vol. 23, pp. 835-845, 2008.

[5]� N. M. Nawi, R. S. Ransing, M. N. M. Salleh, R. Ghazali and N. A. Hamid, "An improved back propagation neural network algorithm on classification problems," in Database Theory and Application, Bio-Science and Bio-Technology vol. 118, Y. Zhang, et al., Eds., ed. Berlin Heidelberg: Springer 2010, pp. 177-188.

[6]� M. Negnevitsky, Artificial intelligence: A guide to intelligent systems. Essex, England: Pearson Education, 2002.

[7]� D. C. Ciresan, U. Meier, L. M. Gambardella and J. Schmidhuber, "Deep big simple neural nets excel on handwritten digit recognition," Neural Computation, vol. 22, pp. 3207 - 3220, 2010.

[8]� A. R. M. Kattan, R. Abdullah and R. A. Salam, "Training feed-forward neural networks using a parallel genetic algorithm with the best must survive strategy," in International Conference on Intelligent Systems, Modelling and Simulation, Liverpool, 2010, pp. 96-99.

[9]� K. Ganeshamoorty and D. N. Ranasinghe, "On the performance of parallel neural network implementations on distributed memory architectures," in Eight IEEE International Symposium on Cluster Computing and the Grid, 2008, pp. 90-97.

[10]� S. Babii, "Performance evaluation for training a distributed backpropagation implementation," in SACI '07. 4th International Symposium on Applied Computational Intelligence and Informatics, 2007, pp. 273-278.

[11]� V. Turchenko and L. Grandinetti, "Efficiency analysis of parallel batch pattern nn training on general purpose supercomputer," in Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing and Ambient, Salamanca, Spain, 2009, pp. 223-226.

[12]� Y. Feng and Y. Fan, "Character recognition using parallel bp neural network," in International Conference on Audio, Language and Image Processing 2008, pp. 1595-1599.

[13]� C. Enachesu and C.-D. Miron, "Handwritten digits recogntion using neural computing," in International Conference Interdiciplinarity in Engineering, Romania, 2010, pp. 95-104.

[14]� K. Zhizhou, W. Lai, Y. Shenggang, G. Dong and C. Zixing, "A novel clustering-ensemble approach," in The 2nd International Conference on Bioinformatics and Biomedical Engineering, 2008. ICBBE 2008. , 2008, pp. 722-724.

[15]� Y. Yang, Y. Xu and Q.-x. Zhu, "A neural network ensemble method with new definition of diversity based on output error curve," in Intelligent Computing and Integrated Systems (ICISS), 2010 International Conference on, 2010, pp. 587-590.

[16]� L. Yue, T. Zaixia and Z. Bofeng, "Computational grid based neural network ensemble learning platform and its application," in Management of e-Commerce and e-Government, 2009. ICMECG '09. International Conference on, 2009, pp. 176-181.

[17]� J. L. Roux and E. McDermott, "Optimization methods for discriminative training," in InterSpeech, Lisbon, Portugal, 2005, pp. 3341-3344.

[18]� H. Yu and B. M. Wilamowski, "Efficient and reliable training of neural networks," in 2nd conference on Human System Interactions, Catalina, Italy, 2009, pp. 106-112

[19]� A. E. Kostopoulos and T. N. Grapsa, "Self-scaled conjugate gradient training algorithms," Neurocomputing, vol. 72, pp. 3000-3019, 2009.

[20]� B. Schrauwen, D. Verstraeten and J. V. Campenhout, "An overview of reservoir computing: Theory, applications and implementations," in European Symposium on Artificial Neural Networks Bruges, Belgium, 2007.

[21]� U.-P. Wena, K.-M. Lan and H.-S. Shih, "A review of hopfield neural networks for solving mathematical programming problems," European Journal of Operational Research, vol. 198, pp. 675–687, 1 November 2009.

[22]� B. Karlik and A. V. Olgac, "Performance analysis of various activation functions in generalized mlp architectures of neural networks," International Journal of Artificial Intelligence and Expert Systems(IJAE), vol. 1, pp. 111-124, 2011.

[23]� A. Strehl and J. Ghosh, "Cluster ensembles – a knowledge reuse framework for combining multiple partitions," Journal of Machine Learning Research vol. 3 pp. 583-617, 2002.

48

[ieee 2012 international conference on ict convergence (ictc) - jeju, korea (south)...

Documents