output partitioning of neural networks final output partitioning...6 where wjk is the connection...

1

Output Partitioning of Neural Networks

Sheng-Uei Guan∗, Qi Yinan, Syn Kiat Tan and Shanchun Li

Department of Electrical and Computer Engineering, National University of Singapore

10 Kent Ridge Crescent, Singapore 119260

Abstract Many constructive learning algorithms have been proposed to find an

appropriate network structure for a classification problem automatically. Constructive

learning algorithms have drawbacks especially when used for complex tasks and modular

approaches have been devised to solve these drawbacks. At the same time, parallel

training for neural networks with fixed configurations has also been proposed to

accelerate the training process. A new approach that combines advantages of constructive

learning and parallelism, output partitioning, is presented in this paper. Classification

error is used to guide the proposed incremental-partitioning algorithm, which divides the

original dataset into several smaller sub-datasets with distinct classes. Each sub-dataset is

then handled in parallel, by a smaller constructively trained sub-network which uses the

whole input vector and produces a portion of the final output vector where each class is

represented by one unit. Three classification datasets are used to test the validity of this

method, and results show that this method reduces the classification test error.

Index Terms ⎯ constructive learning algorithm, neural networks, output partitioning

∗ Corresponding author, e-mail address: [email protected]. The authors are with the Department of Electrical and Computer Engineering, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260.

2

1. Introduction

It is widely known that network size is of crucial importance for neural networks. Too

small a network cannot learn the problem well [1], while a size too large will lead to

overfitting and thus poor generalization [2]. It is a key issue in neural network design to

find an appropriate network size automatically for a given application and optimize the

set of network weights.

There are three approaches to tackle this issue: pruning, regularization, and

constructive algorithms. In pruning [3], some hidden units or weights are removed during

training if they are no longer actively used. Regularization uses some penalty terms in the

cost function to force the weights to yield smooth approximations [4]. The third

approach, called growing or constructive approach, starts with a small network and then

grows additional hidden units and weights until a satisfactory solution is found. The

constructive approach has a number of advantages over pruning and regularization

approaches. Detailed descriptions can be found in [5].

Constructive methods include the Dynamic Node Creation (DNC) method [6],

CasCor family of learning algorithms [7] (including standard Cascade-Correlation (CC)

algorithm [8]), Constructive single-hidden-layer network [9] and Constructive

Backpropagation (CBP) algorithm [10], etc. Among them, DNC and CBP have only one

single hidden layer and a new hidden unit receives complete connections from the inputs

and is connected to all output units. DNC has no shortcut connections between the input

units and the output units while CBP begins with shortcut connections. CC begins with

shortcut connections and then automatically trains and adds new hidden units one by one

3

to create a multilayered network. Each hidden layer so constructed consists of just one

single unit, which receives connections from each of the network’s inputs as well as from

all hidden units in previous layers.

Efforts [7, 9-13] have been made to compare their generalization capability. The

results obtained show that: 1) In most cases, whether to cascade hidden units or not does

not make a significant difference at all; 2) For some datasets, especially for regression

problems, non-cascading hidden units is even superior to cascading them.

Besides constructive methods above, Bayesian approaches [27] can be appropriate

alternatives to automatically determine the optimal network size. In [28, 29], Mackay

suggested a hierarchical inference approach with the following three progressive levels:

weight inference, hyperparameter inference and model comparison. Using Bayesian

approaches, model comparisons do not require any validation sets, hence more samples

are available for training. However, Bayesian approaches are in general computationally

expensive.

Although constructive learning algorithms can automatically find an optimal

combination, they are still suffering from drawbacks such as inefficiency in utilizing

network resources as the task (and the network) gets larger, and inability of the current

learning schemes to cope with high-complexity tasks [14]. Large networks tend to

introduce high internal interference because of the strong coupling among their input-to-

hidden layer weights [15]. Modular neural networks attempts to solve these issues via a

“divide and conquer” approach. Using this approach, [16] divides the training set into

subsets recursively using hyperplanes until all the subsets become linearly separable. [17]

4

constructs neural networks where the first unit introduced on each hidden layer is trained

on all examples and further units on the layer are trained primarily on examples not

already correctly classified. Using modular neural networks, the input data are partitioned

into several subspaces, and simple systems are trained to fit the local data. Such data

partitioning is often more effective than training on the whole input data space [18].

At the same time, parallel training has also been proposed in order to gain faster

training. For multilayered neural networks, backpropagation algorithms, including BP

[19], RPROP [20], Quickprop [21], SuperSAB [22], etc., reveal four different types of

parallelism as follows [23]: training session parallelism, training set parallelism,

pipelining and node parallelism. These four parallelisms are commonly used for parallel

training of a neural network with fixed configuration.

Different from the previously proposed modular neural networks and the four

network parallelisms mentioned above we propose a new approach — output partitioning.

A dataset to be classified can be partitioned into several smaller sub-datasets with distinct

classes. Each sub-dataset is then handled by a smaller sub-network which uses the whole

input vector as input and produces a portion of the final output vector (each class is

represented by a unit). Each sub-network solution to each sub-dataset is grown and

trained using constructive algorithms; this can be performed simultaneously on parallel

processing elements. The grown sub-networks are then integrated to produce the final

results. This method creates smaller neural networks which have reduced internal

interference among hidden layers, consequently, reduces computational time and

improves performance. In section 1, we briefly recall the CBP learning algorithm. The

concept of output partitioning is described in section 2. The proposed incremental-

5

partitioning algorithm is then described in section 3. Experiments about output

partitioning are implemented and the results analyzed in section 4. Finally, the

conclusions are presented in section 5.

1.1 Constructive backpropagation learning algorithm

The CBP learning algorithm (depicted in Figure 1) can be described briefly as follows

[10]:

1. Initialization: The network has no hidden units initially. Only bias weights and

shortcut connections from input units to output units feed the output units. The weights of

this initial configuration are trained by minimizing the sum of squared errors cost

function:

∑∑= =

−=P

p

K

kpkpk toSSE

1 1

2)( (1)

where P is the number of training patterns, K is the number of output units, pko is the

actual output value at the k th output node for the p th training pattern and pkt is the

desired output value at the k th output node for the p th training pattern. After training,

the weights of this initial configuration are fixed permanently.

2. Training a new hidden unit: Connect inputs to the new unit (let the new unit be the i th

hidden unit, 0>i ) and connect its output to the output units. Adjust all the weights

connected to the new unit (both input/output connections) by minimizing modified SSE

(mSSE):

2

1 1

1

0

)(∑∑ ∑= =

−

=⎟⎟⎠

⎞⎜⎜⎝

⎛−+=

P

p

K

kpkpiik

i

jpjjki towowamSSE (2)

6

where jkw is the connection from the j th hidden neuron to the k th output unit

( kw0 represents a set of weights which are the bias weights and shortcut connections

trained in step 1), pjo is the output of the j th hidden neuron for the p th training pattern

( 0po represents inputs to bias weights and shortcut connections), while )(⋅a is the

activation function. Note that from the new i th neuron's perspective, the previous

neurons are fixed. We are only training the weights connected to the new unit.

3. Freezing the new hidden unit: Fix the weights connected to the units permanently.

4. Testing for convergence: If the current number of hidden units yields an acceptable

solution, then stop training. Otherwise go back to step 2.

Figure 1. Evolution of the CBP architecture

2. Output partitioning

The goal for constructively constructing a neural network with output partitioning is to

obtain an appropriate architecture of sub-networks with a set of weights that

satisfy thEE < ( thE is the threshold of E ).

7

2.1 Dataset partitioning

If we partition the output vector (each class is represented by one unit) into r sections

( rSSS ,,, 21 ⋅⋅⋅ ), each containing at least one output unit, then equation (1) can be

transformed into:

∑∑= =

−=P

p

K

kpkpk toE

1 1

2)(

∑ ∑ ∑= =

+

+=⎢⎣

⎡−+−=

P

p

S

k

SS

Skpkpkpkpk toto

1 1 1

221

1

21

12

2211)()( ⎥

⎦

⎤−+⋅⋅⋅+ ∑

++⋅⋅⋅++= −

K

SSSkpkpk

rr

rrto

1

2

121

)(

∑ ∑ ∑∑∑∑= = ++⋅⋅⋅++=

+

+== = −

−+⋅⋅⋅+−+−=P

p

P

p

K

SSSkpkpk

SS

Skpkpk

P

p

S

kpkpk

rr

rrtototo

1 1 1

2

1

2

1 1

2

121

21

12

22

1

1

11)()()(

rEEE +⋅⋅⋅++= 21 (3)

where KSSS r =+⋅⋅⋅++ 21 .

1E , 2E ,…, rE are independent of each other. The only constraint among them is

that their sum E should be smaller than thE . So we can partition the original dataset into

r sub-datasets.

2.2 Sub-network growing

After partitioning, the original dataset is divided into r sub-datasets. The original single

neural network solution for the dataset is replaced by r sub-networks (sub-NN), i.e. sub-

NN1, sub-NN2, …, sub-NNr, each of which is constructed for a sub-dataset, as shown in

Figure 2. Each sub-NN can be grown and trained on different processing elements. When

8

we are classifying an unknown sample, each sub-NN computes a portion of the output

vector and their results are merged to generate the final output.

Figure 2. Parallel growing based on output partitioning

3. Algorithm for growing and training neural networks using

output partitioning

3.1 Definitions

Constructive learning algorithms are very sensitive to changes in the stopping criteria. If

training is too short, the components of the network will not work together well enough

for good results. If training is too long, it costs too much computation time and may result

in overfitting and bad generalization. In this paper, we adopt the method of early stopping

[7, 24] using a validation set.

The set of available patterns is divided into three sets: a training set is used to

train the network, a validation set is used to evaluate the quality of the network during

training and to measure overfitting, and finally a test set is used at the end of training to

1 2 K-1 K …

sub-NN1 sub-NNr-1 sub-NNr

… 1−rS 1S rS

9

evaluate the resultant network. In this paper, the size of the training, validation and test

set is 50%, 25% and 25% of the dataset’s total available patterns.

The error measure E used is the squared error percentage [24], derived from the

normalization of the mean squared error to reduce the dependence on the number of

coefficients in (1) and on the range of output values used:

∑∑= =

−⋅−

⋅=P

p

K

kpkpk to

PKoo

E1 1

2minmax )(100 (4)

where maxo and mino are the maximum and minimum output values in (1).

)(tEtr is the average error per pattern of the network over the training set,

measured after epoch t . The value )(tEva is the corresponding error on the validation set

after epoch t and is used by the stopping criteria. )(tEte is the corresponding error on the

test set; it is not known to the training algorithm but characterizes the quality of the

network resulting from training. The value )(tEopt is defined to be the lowest validation

set error obtained in epochs up to epoch t :

)'(min)('

tEtE vattopt ≤= (5)

The relative increase of the validation error over the minimum so far (in percent)

is defined as the generalization loss at epoch t :

)1)()((100)( −⋅=tEtEtGL

opt

va (6)

A high generalization loss is one candidate reason to stop training because it

directly indicates overfitting.

10

A training strip of length m [24] is defined to be a sequence of m epochs

numbered n+1, …, n+m, where n is divisible by m. The training progress measured

after a training strip is:

)1)'(min

)'((1000)(

...1'

...1' −⋅

⋅=+−∈

+−∈∑tEm

tEtP

trtmtt

tmtt trm (7)

that is used to measure how much larger the average training error is than the minimum

training error during the training strip.

3.2 Procedure for growing and training of the sub-networks

The procedure for growing and training each sub-NN is shown in Figure 3. sub_epoch is

used to represent running epochs for training one configuration of a sub-NN. total_epoch

is used to represent the total running epochs for growing the sub-NN (the sum of all

sub_epochs). Initially, the sub-NN has no hidden units. There are only bias weights and

the shortcut connections between inputs and output units. Now sub_epoch and

total_epoch are both set to 1. Then train the initial neural network using the RPROP

algorithm. Set the corresponding validation error va optE E= , the optimal validation error

so far. Record the weights accordingly as the optimal weights.

After every m epochs (m is strip length, we used m =5), compare vaE to optE . If

vaE is less than optE , then set optE as vaE and record the weights accordingly as the

optimal weights. After every epoch, check the overall stopping criteria :

11

(Training has reached Z = 5000 epochs) OR ( thopt EE < )

OR ((Reduction of training set error due to the last new hidden unit is less than 1% )

AND (Validation set error increased due to the last new hidden unit))

If the overall stopping criteria are not satisfied, then check the criteria for adding a

new hidden unit. If a new hidden unit should be added, then copy weights from the

optimal weights and freeze the weights. The criteria for adding a new hidden unit:

(At least X epochs reached)

AND ((Generalization loss )(tGL >5) OR (Training progress )(tPm <0.1) OR (Y

epochs reached))

Figure 3. The growing and training procedure for sub-NNs

Construct and Initialize sub-NN sub_epoch = 1 total_epoch = 1 While (total_epoch < Z) { Train the current configuration of sub-NN for one epoch If (sub_epoch == 1)

optE = vaE (Record the weights accordingly as the optimal weights)

If (sub_epoch % m == 0 && vaE < optE )

optE = vaE (Record the weights accordingly as the optimal weights)

If ( thopt EE < || little-improvement from last new hidden unit) Break

If (sub_epoch > X && ( )(tGL >5 || )(tPk <0.1 || sub_epoch > Y)) Copy weights from the optimal weights

Add a new hidden unit and initialize the weights (randomly) sub_epoch = 0 sub_epoch ++ total_epoch++ } Calculate teE and Exit

#We used X = 80 epochs while Y = 500 epochs.

12

3.3 Incremental-Partitioning algorithm

The motivation to use several smaller sub-NNs is to reduce the high internal interference

inherent in large networks [15] which has to classify all K classes in a particular dataset.

By dividing the original dataset into several sub-datasets with distinct classes and using

smaller separate, sub-NNs for each sub-dataset of classes, we can reduce the internal

interferences in the input-to-hidden layer weights, hence reduce the overall classification

errors. Classification error is used as a metric to measure the interference among different

classes. Here we propose an incremental-partitioning algorithm to produce near-optimal

partitioning of classes according to the classification error. The algorithm produces a near

optimal partitioning of the classes, in the form of an unordered list – for example

{{6,3,2,11,9},{10,1,8},{4,7},{5}} where each element in the list represents a specific

class in the original dataset (11 classes). In the above example, there are four partitions

each representing a sub-dataset that will be trained on a sub-NN separately.

The details of this algorithm are described as follows.

Step 1: Find the classification error iC of each class and order them in ascending order as

{ ,..., ,..., }a b cC C C , where a b cC C C≤ ≤ . To obtain the individual iC , 1 i K≤ ≤ , all

patterns not belong to class i are labeled as patterns of class i . A single NN is then used

for the resulting two-class classification problem.

Step 2: Form the first partition with the class i , i.e. {{ }}i .

13

Step 3: Choose another class j , add it to the above partition, i.e. {{ , }}i j and measure the

classification error. Similar to Step 1, all patterns not belonging to class i and class j are

assigned as patterns of class ,i j . A single NN is then used for the resulting three-class

classification problem. Next, put class j into a new partition, i.e. {{ },{ }}i j and measure

again the classification error (training a sub-NN for each partition). For each sub-NN

training respectively, all patterns not belonging to class i /class j are assigned as patterns

of class i /class j . A single NN is then used for each resulting two-class classification

problem. Adopt the partitioning that produces a smaller classification error.

Step 4: Repeat Step 3 for all remaining classes in the dataset. Assign another class to each

existing partition and measure the resulting classification error. Next, form a new

partition containing only this class and again measure the resulting classification error.

For each sub-NN to be trained, all patterns not belonging to the classes in the partition

will be assigned as patterns of a single “dummy” class. The classification error thus

obtained is a measure of the difficulty in discriminating between patterns of a particular

class (or some classes) and patterns from all the other classes. Based on the smallest

possible classification error over all the possible combinations tested, this class is then

either added to any existing partition or used to form a new single-class partition. In the

final list, each partition contains distinct classes to be classified using a separate sub-NN.

There are two major considerations required in the above algorithm - how to

choose the first class to form the first partition and how to decide the order of assigning

the remaining classes.

14

For the first class, we chose the class with the largest classification error (last

element in { ,..., ,..., }a b cC C C ) since this class will dominate both the NN performance

and the NN structure. When the classification error for a particular class is high, it will be

very difficult to distinguish between the patterns belonging to this class and those

patterns from the other classes (see Step 1 on how iC is obtained). Therefore, the

classification error for patterns from this class will dominate the overall classification

error. The size of the classification error also influences directly the overall structure of a

NN. During NN training, the NN either generates more hidden units or performs more

weight adjustment to reduce any large classification error.

We assign the remaining classes using the ordered list { ,..., ,...} \a b cC C C (the

class with largest classification error has previously been assigned), that is to try and

partition the classes with the smallest classification error first. Patterns belonging to

classes with smaller errors tend to be more easily distinguished from patterns belonging

to the other classes. And it is more likely to obtain the lowest possible classification error

when such classes with smaller errors are assigned into existing partitions. Furthermore,

lesser partitions will be present in the final list generated. With lesser sub-NNs to be

trained, fewer resources are required.

Soft constraints were also used in the incremental-partitioning process where a

small predefined performance margin is used for considering alternative solutions.

During each stage of assigning classes into partitions, there can exist several intermediate

solutions whose classification errors are close. Whenever a big increase in the

classification error for all possible solutions is encountered in any stage of the algorithm,

15

back-tracking is performed and we consider alternative intermediate solutions (with

classification errors within the margin) obtained in the previous stage.

4. Experimental results

4.1 Experiment scheme

In this paper, we adopt the CBP network growing algorithm with the RPROP algorithm

[20] for training the sub-NNs. The RPROP algorithm was used with the following

parameters: 2.1=+η , 5.0=−η , 1.00 =∆ , 50max =∆ , 60.1min −=∆ e , initial weights

from –0.25 … 0.25 randomly. The hidden units and output units all use the sigmoid

activation function. All the experiments were simulated on a single Pentium III –500 PC.

If implemented in parallel on multiple processors, communication overhead may be

negligibly small as there is not much communication until the result merging stage.

4.2 Benchmark dataset descriptions

We examined the above automatic partitioning procedure on three classification datasets

(see Table 1 for description). For each classification dataset, we use the incremental-

partitioning algorithm with performance margin 1% (i.e. alternative solutions have

classification errors within 1% of the lowest classification error), as given in Section 3.2.

Table 1. Benchmark dataset descriptions

Dataset No. of outputs (classes)

No. of inputs No. of examples in (train / validation / test) sets

Thyroid 3 20 3600 / 1800 / 1800 Glass 6 9 107 / 54 / 53 Vowel 11 10 495 / 248 / 247

16

4.2.1 Thyroid1

The individual classification error of each class is shown in Table 2. According to the

algorithm in Section 3.2, we form the first partition with class 2. The incremental-

partitioning process is shown in Table 3.

Table 2. Classification errors for individual classes in Thyroid1

Class 1 2 3 Ci 1.58 1.83 1.72

Table 3. Incremental-partitioning of Thyroid1

Class Assigned Partitioning / Ci Partitioning / Ci Partitioning / Ci 2 {2} / 1.83 1 {2,1} / 2.22 {{2},{1}} / 1.89 3 {{2,3},{1}} / 2.06 {{2},{1,3}} / 1.72 {{2},{1},{3}} / 1.89

The solutions formed during each stage, together with their associated

classification error iC , are shown in Table 3. The class being assigned is shown in the

leftmost column and the classification errors of partitionings attempted by the algorithm

are recorded in the subsequent columns with the best current partitioning highlighted in

bold. As the algorithm progresses, more partitionings will be attempted if more partitions

were formed in the earlier stages. The partitionings can be read as, for example {2,1}

means class 2 and 1 are assigned into one same partition and {{2},{1}} means class 2

and 1 are assigned into two different partitions. For the Thyroid1 problem, we get the

near-optimal partition of {{2},{1, 3}}.Compared with the non-partitioning and full-

partitioning methods (shown in Table 4), the classification error was reduced by 8.9%

and 7.5% respectively.

Table 4. Comparison of different methods for Thyroid1

Non-partitioning Full-partitioning Incremental-partitioning Classification error 1.86 1.89 1.72

17

4.2.2 Glass1

The individual classification error of each class is shown in Table 5. We form the first

partition with class 2. The incremental-partitioning process is shown in Table 6.

Table 5. Classification errors for individual classes in Glass1

Class 1 2 3 4 5 6 Ci 20.28 34.91 8.30 1.04 0.85 9.43

Table 6. Incremental-partitioning of Glass1

Class Assigned

Partitioning / Ci Partitioning / Ci Partitioning / Ci Partitioning / Ci

2 {2} / 34.91

5 {2,5} / 34.34

{{2},{5}} / 34.15

{2,5,4} / 37.74

{{2,5},{4}} / 36.91

{{2,4},{5}} / 35.00

{{2},{4,5}} / 35.85

4

{{2},{4},{5}} / 34.82

{{2,3},{4},{5}} / 30.76

{{2,}{3,4},{5}} / 34.49

{{2},{4},{3,5}} / 35.57

{{2},{3},{4},{5}} / 29.06

3

{{2,3,4},{5}} / 32.08e

{{2,4},{3,5}} / 41.51

{{2,4},{3},{5}} / 43.39

{{2,6},{3},{4},{5}} / 36.42

{{2},{3,6},{4},{5}} / 37.08

{{2},{3},{4,6},{5}} / 37.08

{{2},{3},{4},{5,6}} / 39.05

6

{{2},{3},{4},{5},{6}} / 36.13

{{2,6,1},{3},{4},{5}} / 32.93

{{2,6},{3,1},{4},{5}} / 36.96

{{2,6},{3},{4,1},{5}} / 37.73

{{2,6},{3},{4},{5,1}} / 38.85

{{2,1},{3},{4},{5},{6}} / 36.32

{{2},{3,1},{4},{5},{6}} / 36.96

{{2},{3},{4,1},{5},{6}} / 39.84

{{2},{3},{4},{5,1},{6}} / 37.71

1

{{2,},{3},{4},{5},{1,6}} / 39.88

As shown in Table 6, in some stages, partitionings whose classification errors are within

the specified performance margin (1%), are also kept for future generation (more than

one entry highlighted in bold). Compared with the non-partitioning and full-partitioning

methods (shown in Table 7), the classification error was reduced by 8.9% and 20%

respectively.

18

Table 7. Comparison of different methods for Glass1


4.2.3 Vowel1

The individual classification error of each class is shown in Table 8. We form the first

partition with class 6. The incremental-partitioning process is shown in Table 9.

Table 8. Classification errors for individual classes in Vowel1

Class 1 2 3 4 5 6 7 8 9 10 11 C.error 2.00 2.39 2.00 2.81 5.43 8.40 4.72 3.95 7.45 2.19 6.84

Table 9. Incremental-partitioning of Vowel1

Class Assigned

Partitioning / Ci Partitioning / Ci Partitioning / Ci

6 {6} / 8.40 3 {6,3} / 8.40 {{6},{3}} / 11.84 1 {{6,3},{1}} / 13.26 {6,3,1} / 15.69 10 {{6,3,10},{1}} / 15.49 {{6,3},{10},{1}} / 13.82 {{6, 3},{10},{1}} / 12.85 2 {{6,3,2},{10,1}} / 13.82 {{6,3},{10,1,2}} / 14.63 {{6,3},{10,1},{2}} / 14.37 4 {{6,3,2,4},{10,1}} / 14.47 {{6,3,2},{10,1,4}} / 13.77 {{6,3,2},{10,1},{4}} / 13.21 8 {{6,3,2,8},{10,1},{4} / 19 {{6,3,2},{10,1,8},{4}} / 13.92 {{6,3,2},{10,1},{4,8}} / 17.71 {{6,3,2},{10,1},{4},{8}}

/ 17.41

7 {{6,3,2,7},{10,1,8},{4}} / 22.98

{{6,3,2},{10,1,8,7},{4}} / 20.39

{{6,3,2},{10,1,8},{4,7}} / 19.08

{{6,3,2},{10,1,8},{4},{7}} / 20.60

5 {{6,3,2,5},{10,1,8},{4,7}} / 18.72

{{6,3,2},{10,1,8,5},{4,7}} / 16.85

{{6,3,2},{10,1,8},{4,7,5}} / 19.64

{{6,3,2},{10,1,8},{4,7},{5}} / 16.09

11 {{6,3,2,11},{10,1,8},{4,7},{5}} / 21.15

{{6,3,2},{10,1,8,11},{4,7},{5}} / 25.40

{{6,3,2},{10,1,8},{4,7,11},{5}} / 27.53

{{6,3,2},{10,1,8},{4,7},{5,11} / 21.41

{{6,3,2},{10,1,8},{4,7},{5},{11}} / 22.67

9 {{6,3,2,11,9},{10,1,8},{4,7},{5}} / 18.57

{{6,3,2,11},{10,1,8,9},{4,7},{5}} / 28.34

{{6,3,2,11},{10,1,8},{4,7,9},{5}} / 20.11

{{6,3,2,11},{10,1,8},{4,7},{5,9}} / 19.43

{{6,3,2,11},{10,1,8},{4,7},{5},{9}} / 21.15

19

We obtain the near-optimal partitioning {{6, 3, 2, 11, 9}, {10, 1 8}, {4, 7}, {5}}. The

comparison with the non-partitioning and full-partitioning methods is shown in Table 10.

We can find the classification error was reduced significantly by partitioning. Recalling

from Table 9, an interesting result is that we can find many solutions whose error is

smaller than the non-partitioning or full-partitioning methods. For this dataset, there exist

several incremental-partitioning solutions.

Table 10. Comparison of different methods for Vowel1


4.3 Analysis of complexity

In the proposed algorithm, the computation complexity is not constant but changes with

the specified performance margin. When soft constraints are not applied, the required

number of partitions to test is 1

K

ii

=∑ , where K is the total number of classes. If the

specified performance margin is large, the number of competing partitioning candidates

per stage will increase. In the extreme situation where the margin is infinite, the searching

algorithm will search every possible partitioning to look for the global optima. Therefore,

we can find a partitioning with the best performance. If the margin is small, for example

1% in above cases, the number of competing partitions is decreased. So the complexity of

computation is also decreased. However, the final partitioning is less probable to be the

global optima. In conclusion, there is a tradeoff between the complexity of computation

and the performance of the incremental-partitioning algorithm.

5. Conclusions

An approach to grow and train neural networks based on output partitioning is presented.

A dataset can be partitioned into several simpler sub-datasets where internal interference

20

is greatly reduced. From the experimental results obtained for all three datasets,

partitioning a dataset using our proposed incremental-partitioning algorithm leads to

lower classification errors. The results were better than either the non-partitioning and

full-partitioning methods. The improvement is especially significant in datasets with

more classes (Vowel). In datasets with large number of classes, the chances of having

imbalanced class data are higher. The actual distance between each class is also likely to

be non-uniform. Output partitioning utilizes the above phenomenon to assign appropriate

classes to separate sub-NNs with the incremental-partitioning algorithm. And through

adjusting the performance margin, we have tradeoff between the computation complexity

and the performance of the final partitioning.

References

[1] A. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete, Neural

Networks, 5(1), 1992, pp. 117-128.

[2] E. B. Baum and D. Haussler, What size net gives valid generalization? Neural

Computation, 1(1), 1989, pp. 151-160.

[3] R. Reed, Pruning algorithm—a survey, IEEE Transactions on Neural Networks, 4(5), 1993,

pp. 740-747.

[4] T. Poggio, and F. Girosi, Regularization algorithms for leaning that are equivalent to multi-

layer networks, Science, 247, 1990, pp. 978-982.

[5] T. Y. Kwok and D. Y. Yeung, Objective functions for training new hidden units in

constructive neural networks, IEEE Transactions on Neural Networks, 8(5), 1997, pp.

1131-1148.

[6] T. Ash, Dynamic node creation in backpropagation networks, Connection Science, 1(4),

1989, pp. 365-375.

[7] L. Prechelt, Investigation of the CasCor family of learning algorithms, Neural Networks,

10(5), 1997, pp. 885-896.

21

[8] S. E. Fahlman and C. Lebiere, The cascade-correlation learning architecture, In D. S.

Touretzky, Advances in neural information processing systems 2, Morgan Kaufmann

Publishers, CA, 1990, pp. 524-532.

[9] D. Y. Yeung, A neural network approach to constructive induction, Proceedings of the

Eighth International Workshop on Machine Learning, Evanston, Illinois, 1991, U.S.A.

[10] M. Lehtokangas, Modelling with constructive backpropagation, Neural Networks, 12, 1999,

pp. 707-716.

[11] S. Sjogaard, Generalization in cascade-correlation networks, Proceedings of the IEEE

signal processing workshop, 1992, pp. 59-68.

[12] J. Yang and V. Honavar, Experiments with cascade-correlation algorithm, Technical report

91-16, Department of Computer Science, Iowa State University, 1991.

[13] C. S. Squires and J. W. Shavlik, Experimental analysis of aspects of the cascade-correlation

learning architecture, Machine learning research group working paper 91-1, Computer

Science Department, University of Wisconsin-Madison, 1991.

[14] G. Auda, M. Kamel and H. Raafat, Modular neural network architectures for classification,

IEEE International Conference on Neural Networks, V2, 1996, pp. 1279-1284.

[15] R. A. Jacobs and M. I. Jordan et al., Adaptive mixtures of local experts, Neural

Computation, 3(1), 1991, pp. 79-87.

[16] P. Liang, Problem decomposition and subgoaling in artificial neural networks, IEEE

International Conference on Systems, Man and Cybernetics, Los Angeles, CA. 1990,

pp.178-181.

[17] S. G. Romaniuk and L. O. Hall, Divide and conquer neural networks, Neural Networks, V6,

1993, pp. 1105-1116.

[18] A. J. C. Sharkey, Modularity, combining and artificial neural nets, Connection Science,

9(1), 1997, pp. 3-10.

[19] D. E. Rumelhart, G. Hinton and R. Williams, Learning internal representations by error

propagation, In D. E. Rumelhart, and J. L. McClelland, Parallel Distributed Processing,

Vol. I Foundations, MIT Press, Cambridge, MA, 1986, pp. 318-362.

22

[20] M. Riedmiller and H. Braun, A direct adaptive method for faster backpropagation learning:

the RPROP algorithm, Proceedings of the IEEE International Conference on Neural

Networks, 1993, pp.586-591.

[21] S. E. Fahlman, An empirical study of learning speed in backpropagation networks,

Technical report, CMU-CS-88-162, Carnegie-Mellon University, 1988.

[22] T. Tollenaere, Supersab: Fast adaptive backpropagation with good scaling properties,

Neural Networks, 3(5), 1990, pp. 561-573.

[23] Jim. Torresen, and Olav. Landsvek, A review of parallel implementations of

backpropagation neural networks, In N. Sundararajan, and P. Saratchandran, Parallel

Architectures for Artificial Neural Networks: Paradigms and Implementations, IEEE

Computer Society Press, Los Alamitos, California, 1998, pp. 25-63.

[24] L. Prechelt, PROBEN1: A set of neural network benchmark problems and benchmarking

rules, Technical Report 21/94, Department of Informatics, University of Karlsruhe,

Germany, 1994.

[25] R. O. Duda, and P.E. Hart, Pattern Classification and Scene Analysis, New York:

Academic Express, 1973.

[26] S. Bernhard and J. Smola Alexander, Learning with kernels, Cambridge, Massachusetts:

The MIT Press, 2002.

[27] J. Lampinen and A. Vehtari, Bayesian approach for neural networks, Neural Networks,

April 2001, pp. 7-24.

[28] D. J. C. Mackay, A practical bayesian framework for backpropagation, Neural

Computation, Vol. 4, 1992, pp. 448-472.

[29] D. J. C. Mackay, Bayesian methods for adaptive models, Ph.D. Thesis, California Institute

of Technology, 1991.

output partitioning of neural networks final output partitioning...6 where wjk is the connection...

Documents