instance selection and prototype-based rules in...

Instance Selection and Prototype-based Rules inRapidMiner

Marcin Blachnik 1, Mirosław Kordos 2

1 Silesian University of Technology, Department of Management and Informatics,Gliwice, Poland

2 University of Bielsko-Biala, Department of Mathematics and Computer Science,Bielsko-Biała, Poland

1 IntroductionInstance selection is an important preprocessing procedure, which leads to many ben-efits. But on the other hand it is very a challenging issue. In the past instance selec-tion was mainly used to improve the accuracy of the nearest neighbor classifier or tospeed up the decision making process. Another problem that has especially appearedin recent years is the huge amount of the training data that needs to be processed.This involves very high computational complexity and makes a real challenge for mostlearning algorithms. Among the solutions to overcome this problem instance selectionand construction algorithms deserve special attention. As the feature selection theycan be used to reduce the data size, but in this case by constructing new samples orrejecting some of the existing samples from the training set, thus making the trainingprocess feasible.

Another application of instance selection can be found in prototype-based rules,where the selected instances can help us understand the data properties or the decisionmade by the model (e.g. nearest neighbor model). In the case of prototype-based rulesthe goal of the training process is to reduce as much as possible the size of data andkeep the smallest possible set of representative examples without significant accuracyreduction. The selected examples become the representatives of the population, whichrepresent all the distinctive features of all class categories.

One of examples of the prototype-based rules are the prototype threshold rules,where each of the prototypes is associated with a threshold, such that both represent areception field. The prototype-based rules are typically constructed by the sequentialcovering algorithms or by distance-based decision trees.

In the chapter we show how to use RapidMiner to implement instance selection al-gorithms. The examples of instance selection and construction methods are presentedin the subsequent sections using popular datasets from the UCI repository [2]. Thedatasets are listed in table (1). The presented examples require Instance Selection and

1

Prototype-based Rule plugin (ISPR), which is available at the RapidMiner Market-place.

Table 1: List of datasets used in the experimentsDataset Name # instances # attributes #classes

Iris 150 4 3Diabetes 768 8 2

Heart Disease 297 13 2Brest Cancer 699 9 2SpamBase 4601 57 2

Letter Recognition 20000 16 26

2 Instance Selection and Prototype-based Rule exten-sion

The ISPR plugin is an extension to the RapidMiner which could be installed fromthe RapidMiner Markeplace and can be found under the prules abbreviation. To getaccess to the Markeplace you need to modify the RapidMiner configuration settingsby changing the port number of the update server in the Tools → Preferences →Update tab → rapidminer.update.url tag setting it to http//rapidupdate.de:80/UpdateServer. After changing the configuration settings the process of instal-lation is trivial and you will only have to check the appropriate checkbox in Help →Update RapidMiner. . . .

If you are interested in more details, including the source files, SVN repositoryaccess and JavaDoc go to the www.prules.org webpage.

Installation of the Instance Selection and Prototype-based Rule extension adds anew group of operators called ISPR. The group contains all operators implemented inthe ISPR extension. The main ISPR group includes seven subgroups:

• Classifiers. Classifiers contain all classification and regression operators. Cur-rently there is only one operator called My k-NN. This is an alternative implemen-tation of the k-NN classifier in RapidMiner. The main difference is that it canoperate on modified distance metric, for example including covariance matrixthat can be calculated by some of instance selection or construction algorithms.

• Optimizers. A group of supervised instance construction operators. This groupalso includes a single operator called LVQ which implements 7 different LVQtraining algorithms.

• Selection. It is a group of instance selection operators, where each operator usesthe nearest neighbor rule for instance (example) evaluation.

• Generalized selection. This group includes similar algorithms to the ones whichcan be found in Selection group however the operators are generalized such that

2

they can handle nominal and numerical attributes and also any prediction opera-tor can be used to evaluate the quality of instances.

• Feature selection. This is an extra plugin to the ISPR extension which wrapsthe Infosel++ [8] library for feature selection.

• Clustering. This group contains various clustering algorithms like Fuzzy C-means, Conditional Fuzzy C-Means and vector quantization (VQ) clustering al-gorithm.

• Misc. The Misc Group consists of several useful operators like class iterator,which iterates over classes of the ExampleSet and e.g. allows for independentclustering of each class. Another operator allows for assigning class label forprototypes obtained via clustering.

• Weighting. The last group includes operators for instance weighting. The ob-tained weights can be used for conditional clustering and WLVQ algorithm.

The subsequent sections present examples of operators in each of the above groups.

3 Instance selectionAs described in the introduction there are several purposes for instance selection. Oneis to speed up the k-NN classification process for online prediction tasks. Anotherpurpose is to enable us to understand the data by selecting a small reference subset asin Prototype Based Rules [5]. In both cases the goal is to reduce the maximum numberof selected instances without affecting the accuracy of the k-NN classifier. Yet anotherpurpose is to remove the outliers from the training set.

The instance selection group includes several operators. Each operator implementsa single instance selection algorithm. All the operators in the group have a single inputport - an ExampleSet and three output ports, which are: the set of selected instances,the original ExampleSet and the trained k-NN prediction model. In most of the casesthe model is equivalent to the process, which starts by selecting instances and thentrains the k-NN model. However, some of the instance selection algorithms calculatesome extra parameters like covariance matrix or local feature weights. Although thek-NN prediction model obtained from the instance selection operator can utilize the ex-tra parameters, currently none of the implemented operators provide this functionality,thus both solutions presented in (1) are equivalent. Moreover, all of the operators pro-

(a) (b)

Figure 1: Equivalent examples of applications of Instance Selection operators (a) useof internal k-NN model (b) use of selected instances to train external k-NN model

3

vide internal parameters such as instances_after_selection, instances_before_selectionand compression. This parameters can be used to determine the quality of instanceselection.

3.1 Description of implemented algorithmsAt the time of writing this chapter 11 different algorithms are implemented. These are:

Random selection It is the simplest instance selection operator, which randomly drawnsinstances from the example set. The instances can be selected independently orstratified, such that the ratio of instances from different groups is maintained.

MC selection This is a simple extension of the random selection, also known as Monte-Carlo algorithm. It repeats the random selection given number of times (thenumber of iterations) and selects the best subset. In this algorithm the qualityof the set of selected instances is determined by the accuracy of 1-NN classifier[10]. See Algorithm (3.1)

Algorithm 1 Algorytm MCRequire: T, z, maxit

for i = 1 . . .maxit doP∗ ← Rand(T, z)tacc← Acc(1NN(P∗,T))if tacc > acc thenacc← taccP← P∗

end ifend forreturn P

RMHC selection This operator also belongs to the random selection based operators.It implements the Random Mutation Hill Climbing algorithm defined by Skala[11]. The algorithm has two parameters - the number of prototypes and the num-ber of iterations. It also uses the classification accuracy of the 1-NN classifieras a cost function. The algorithm is based on coding prototypes into a binarystring. To encode a single prototype the dlog2(n)e bits are required, where nis the number of vectors in the training example set. c · dlog2(n)e bits are re-quired to encode c prototypes. The algorithm in each iteration randomly mutatesa single bit and validates the obtained set of instances. If the change improvesresults the new solution is kept for further evaluations otherwise the change isroll-backed - see Algorithm (3.1)

CNN selection CNN is one of the oldest instance selection algorithms. It doesn’t haveany parameters, it starts with a single randomly selected instance and then triesto classify all examples in the training set. If any instance is incorrectly classifiedit is added to the selected set. The algorithm is rather fast and allows for high

4

Algorithm 2 Schemat algorytmu RMHCRequire: T, z, maxit

m← |T|st← Bin(l log2 m)st∗ ← SetBit(st, z)for i = 1 . . .maxit doP∗ ← GetProto(T, st∗)tacc← Acc(1NN(P∗,T))if tacc > acc thenacc← taccP← P∗

st← st∗

end ifst∗ ← PermuteBit(st)

end forreturn P

compression. The quality of the compression depends on the level of noise inthe data. The CNN algorithm is shown on sketch (3.1), where T is a training set,xi is an i’th instance from the training set, P is the selected subset of examples,and pj is a single j’th example from the set P.

Algorithm 3 CNN algorithmRequire: Tm← |T|p1 ← x1

flaga← truewhile flaga doflaga← falsefor i = 1 . . .m doC(xi) =kNN(P,xi)if C(xi) 6= C(xi) thenP← P ∪ xi;T← T \ xi

flaga← trueend if

end forend whilereturn P

ENN selection ENN is a base for a group of algorithms. It uses a single parameter kto determine the number of considered nearest neighbors. The algorithm iteratesover all vectors in the example set and tries to predict their class label. If the pre-dicted label is incorrect the vector is marked for removal. Its sketch is presented

5

in (3.1)

Algorithm 4 Schemat algorytmu ENNRequire: Tm← |T|;remi ← 0;for i = 1 . . .m doC(xi) =kNN((T \ xi),xi);if C(xi) 6= C(xi) thenremi = 1;

end ifend forfor i = 1 . . .m do

if remi == 1 thenT = T \ xi

end ifend forreturn P

RENN selection simple repeats ENN selection until no vector is removed (see algo-rithm (3.1))

Algorithm 5 Schemat algorytmu RENNRequire: T flaga← true

while falga doflaga← falseP =ENN(P,T)if P 6= P thenflaga← true

end ifend whilereturn P

All-kNN selection also belongs to the family based on the ENN algorithm. This op-erator repeats the ENN procedure for a range of different k values.

GE selection is a Gabriel Editing proximity graph based algorithm. It uses a simpleevaluation defined as:

∀a 6=b6=c

D2(pa,pb) > D2(pa,pc) + D2(pb,pc) (1)

to determine if an example pb is a neighbor of an example pa. If they are neigh-bors according to the formula (1) and the class label of pa and all its neighborsis equal then pa is marked for removal.

6

RNG selection is an implementation of Relative neighbor Graph algorithm which isvery similar to the GE selection. It differs in the cost function and the functionused to evaluate the neighbors is modified as follows:

∀a6=b 6=c

D(pa,pb) ≥ max(D(pa,pc), D(pb,pc)) (2)

For more details see []

IB3 selection is a very popular algorithm defined by D. Aha [1] and is a member ofthe IBL (Instance Based Learning) methods. The first two algorithms in theIBl group (IB1 and IB2) are just small modifications of 1-NN introducing thenormalization step and rejection of samples with unknown attributes (IB1) anda simplified one pass CNN algorithm (IB2). The IB3 algorithm extends thisfamily by applying the wait and see principle evaluating samples according tothe following formula:

ACup|low =p + z2

/2n± z

√p(1− p)/n + z2

/4n2

1 + z2/n

(3)

where p denotes the probability of success in n trials and z is a confidence factor.

The prototype is accepted if the lower bound of its accuracy (with the confidencelevel 0.9) is greater then the upper bound of the frequency of its class label anda prototype is removed if the upper bound of its accuracy is lower (with theconfidence level 0.7) than the lower bound of the frequency of its class.

ELH selection is an implementation of the instance selection algorithm proposed byCameron Jones [4]. The algorithm based on Encoding Length Heuristic (ELH)evaluates the influence of an instance rejection. This heuristic is defined as:

J(n, z, nerr) = F (z, n) + z log2(c) + F (nerr, n− z) + nerr log2(c− 1) (4)

where F (z, n) is a cost of coding n examples by z examples.

F (l, n) = log∗

l∑j=0

n!

j!(n− j)!

(5)

where log∗ is a cumulative sum of positive factors log2.The algorithm starts by rejecting an instance and then evaluates the influence ofthe rejection on the ELH heuristic. If the rejection does not decrease the value ofthe heuristic the vector is removed and the whole procedure is repeated.

3.2 Speeding up 1-NN classificationTo speed up the classification of 1-NN algorithm the family of proximity graph algo-rithms may be used. This group of algorithms is represented by the GE Selection and

7

RNG Selection operators. Both of them try to preserve the decision border identical tothe original 1-NN classifier border. Also CNN, ELH and IB3 algorithms can be usedas members of this group. Because particular algorithms may produce different accu-racy and different compression it is worth comparing them. To compare the quality ofdifferent learning algorithms the best practice is to use X-validation operator as shownin the figure(2) However, this example doesn’t provide us any information about the

Figure 2: Validation of the accuracy and compression of the GE Selection operator

compression, which can be extracted from the instance selection operators using Logoperator. For large datasets where the selection process may be time consuming itis better to compare different instance selection techniques at once in a single pro-cess. Such a model should also provide identical testing environment for all instanceselection operators. This can be accomplished by embedding all the instance selec-tion algorithms into a validation loop as shown in fig (3) In this example we would

Figure 3: Selecting the most useful instance selection operator. Validation inner pro-cess configuration

start with explaining the validation procedure, which is equivalent to the one shown infig(1) except that instead of a single instance selection algorithm a family of differentalgorithms is used. The trained 1NN models are then passed to the testing processwhere each of them is applied, and for each one the performance is calculated. In thelast step the Log operator is applied to log all the important values in each validationloop. The Log operator registers the accuracy and the compression. Then the loggedvalues are further used outside of the Validation operator. Now let’s see the main pro-cess presented in fig (4). First the data is retrieved from the repository (here we use the

8

Iris dataset for the simplicity) and then after reading the data all attributes are normal-ized using the Normalize operator (this operator can be found in Data Transformation→ Value Modification→ Numerical Value Modification). This step is very importantbecause attributes after reading the dataset may be differently scaled what affects thedistance calculations. After that we start the validation operator. When it finishes we

Figure 4: Selecting the most useful instance selection operator Main process

can access the extracted Log values. To perform further calculations the logged valuesare transformed into an example set using Log to Data operator (it requires to definethe name of the Log which we want to transform into data. The Log name is the nameof the Log operator). This allows for further manipulations on the values. The loggedvalues contain the accuracy and the compression logged in each validation step whichshould be averaged. For that purpose Aggregate operator is used. The configurationsettings of Log operator and Aggregation operator are shown in fig (5)

(a) Log configuration (b) Aggregate configuration

Figure 5: Selecting the most useful instance selection operator Validation inner processconfiguration

The results for some datasets are presented in table (2). The results were obtainedby extending the process with some extra operators, which allows for automatic resultacquisition. In this example the Loop Datasets operator was used to iterate over severaldatasets as shown in fig (6(a)). The inner process was equipped with a mechanism thatallows combining results. This was obtained by the Recall and Remember operators.The first one is used to retrieve the previous results and the second one stores the newresults in a memory cache (Note. While calling the Remember and Recall operatorswe have to specify the type of input, which will be stored. Here we store ExampleSet).The old results and the new one obtained at the output of the Aggregate operator arecombined using the Append operator (see fig.(6(b))). Since there are no initial results in

9

Table 2: Comparison of various instance selection methods. acc denotes classificationaccuracy, compr denotes compression

Dataset Name kNN acc GE acc RNG acc CNN acc GE compr. RNG compr. CNN compr.Iris 94.66 94.66 90.66 91.33 39.55 18.22 16.96Diabetes 70.97 70.97 67.59 66.67 94.06 59.57 49.16Heart Disease 77.90 77.90 76.94 73.61 99.89 57.86 44.29Brest Cancer 94.85 94.70 92.13 92.71 38.07 12.82 13.54

(a) Main process configuration (b) Combining results configuration

(c) Handling exception configuration

(d) Inner process of Loop Data Sets configuration

Figure 6: Processing many datasets at once

the first run, the Recall operator will pop-up an exception. For that reason the HandleException operator in case of an exception calls the Catch sub process as shown in fig(??)

10

3.3 Outlier eliminationWhile mining datasets we often encounter the problem of noise. Noisy examples mayaffect the quality of learning process. Their influence usually depends on the costfunction used by the learner and on the size of the data (for smaller datasets theirinfluence is much higher than for larger ones). Some of the instance selection operatorscan be used to eliminate the outliers, for example the ENN selection operator and othersin that family like RENN selection or all kNN selection.

To analyze the influence of noise on the model accuracy a process in RapidMinercan be built. In the sample process we will consider the influence of different levels ofnoise on the accuracy of LDA classifier applied to the Iris dataset. The main process ispresented in fig. (7(a))

(a) The main process

(b) The subprocess with Add Noise operator

(c) The subprocess of Validation operator

(d) Loop Parameters configuration

Figure 7: Test of the outlier elimination task

The process starts with retrieving the Iris dataset from the repository, then the datais normalized using the Normalize operator (this operator is set to normalize in range0-1) followed by the Loop Parameters. The final results are retrieved using the Recall

11

operator. Here the Recall operator is used because the final results are stored in thememory cache. The subprocess of the Loop Parameters is presented in fig. (7(b)). Itis similar to the previous one, however the Add Noise operator is added. This operatorallows adding noise into variables or labels, but for our test we will only add noise tothe label attribute. The subprocess of the validation step is presented in fig. (7(c)). Ituses two LDA operators. The first one uses the dataset without noise rejection and thesecond one uses the prototypes obtained from the ENN selection operator. The secondLDA is called Pruned LDA. The remaining operation required to execute the process isthe Loop Parameters operator configuration. For the experiment the Add Noise operatorhas to be selected and then from the list of parameters the label_noise parameter. Weset the minimum level of noise to 0 and maximum to 0.25 in 5 steps (see fig.(7(d))).The obtained results are presented in table (3). From these results we can see that whenthe level of noise is increasing the Pruned LDA improves the results, however in caseof absence of noise or for low level of noise the accuracy may decrease.

Table 3: Comparison of the influence of noise on the accuracy with and without ENNnoise filtering

Pruned LDA LDA Noise Level84.00 86.66 0.081.33 80.00 0.0573.33 74.66 0.1064.00 62.66 0.1555.33 50.66 0.245.33 42.66 0.25

3.4 Advances in instance selectionAll the presented above algorithms assume that only a single instance selection algo-rithm was applied. In practice instance selection algorithms can be combined into aseries of data transformations. This is from the perspective of prototype based rulesbut also when we are interested in high compression.

A good example of such processing is a combination of outlier elimination algo-rithms and algorithms preserving 1-NN decision border (described in section (3.2)).The algorithms which preserve 1-NN decision boundary are very sensitive to any out-liers and noise examples, so it is desired to prune them before applying CNN, GE orRNG algorithms. An example of such applications are presented below. In this examplewe will compare the accuracy and compression of three combinations of algorithms,but instead of already the presented processes we will use the Select Subprocess opera-tor, which will be placed between different instance selection schemes. In this scenarioit is important to set the Use local random seed for the Validation operator, what assuresthat each model will have identical training and testing datasets. The main process ispresented in fig.(8(a)). At the beginning all the datasets are loaded and then the alreadypresented Loop Data Sets operator is used. However, we add here an extra operator

12

(a) Main process configuration (b) Subprocess of the LoopData Sets

(c) Subprocess of the Validation operator

(d) Configuration of the Select Subprocess operator

Figure 8: Process configuration used to compare different instance selection scenarios

called Set Macro to define a variable Dataset_Iterator which identifies datasets used inthe experiments. When the Loop Data Sets operator finishes the Log is transformed intoExampleSet, then it is required to use Parse Numbers operator, which converts nom-inal values into numerics. The need of nominals-to-numeric conversion derives fromthe fact the RapidMiner stores macro values as nominals even when they representnumbers like calculated Compression. Then the ExampleSet is aggregated. However,in this example the Aggregate operator uses two attributes to group results before aver-aging: the Dataset_Iterator and the Model. The final three operators are used to assignappropriate nominal values. The Numerical to Nominal operator is used to transformDataSet and Model attributes into nominals. Two Map operators assigns appropriatenames to the numbers like DataSet value equal 1 is mapped into Diabetes, value 2 intoIonosphere etc.

Inside the Loop Data Sets operator first the value of the Dataset_Iterator macro isincremented using Generate Macro and then the Loop operator is used to iterate overdifferent combinations of instance selection chains (fig. (8(b))). To implement this,the iteration macro is used and the number of iteration is set to the number of different

13

instance selection configurations. The subprocess of Loop operator includes just theX-Validation operator. Its subprocess is presented in fig.(8(c)). The main problem isthe need to manually calculate the Compression of the whole system. For that purposethe Extract Macro operator is used, which extracts the number of instances before andafter instance selection. Then the Generate Macro is used to calculate the Compressionwhich is passed to the testing part. The test subprocess is relatively simple just withtwo extra Provide Macro operators. They are necessary to provide the Macros Valuesin a form that can record the log. Here the performance, Compression, iteration ofthe Loop operator equivalent select_which parameter of the Select Subprocess operatorand the value of Dataset_Iterator macro are logged. The last important part of theprocess is the Select Subprocess, which is visualized in fig. (8(d)). It shows differentconfiguration schemes. In the first one only the CNN algorithm is used, in the secondone it is preceded by the ENN selection while in the last one the earlier two operatorsare followed by the RNG selection. The results obtained for several datasets are visually

Figure 9: Comparison of three different combinations of instance selection algorithms(CNN, ENN+CN, ENN+CNN+RNG. Accuracy as a function of Compression plot

presented in fig (9) using the Advanced Charts data visualization.

4 Prototype construction methodsThe family of prototype construction methods includes all algorithms that produce a setof instances at the output. The family contains all prototype based clustering methodslike k-means, Fuzzy C-Means (FCM) or Vector Quantization (VQ). The family also in-cludes the Learning Vector Quantization (LVQ) set of algorithms. The main differencebetween these algorithms and the instance selection ones is the way of obtaining theresulting set. In instance selection algorithms the resulting set is a subset of the original

14

instances but in case of instance construction methods the resulting set is a set of newinstances.

Now we would like to show different approaches to prototype construction. In ourexamples we will use them to train k-NN classifier, however the obtained instances canalso be used to train other models. According to the paper of L. Kucheva [9] thereare many approaches to use clustering algorithms to obtain the labeled dataset. Theeasiest one is to cluster the dataset and then label the obtained prototypes. This kind ofprocess is shown in fig (10). It starts with loading the dataset, then the data is clusteredusing the FCM algorithm. The obtained cluster centers are then labeled using the Classassigner algorithm. After relabeling the prototypes are used to train k-NN model andthe model is applied to the input dataset.

Figure 10: Process of prototype construction based on clustering and relabeling

Another approach is to cluster each class independently. In this approach we needto iterate over classes and in each iteration we select instances from one single classlabel which will be clustered. The prototypes obtained after relabeling should be con-catenated. This example is shown in fig. (11(a)) In this example the Class iterator is

(a) Main process configuration

(b) Subprocess of the Class Iterator

Figure 11: Process of prototype construction based on clustering of each class andcombing obtained cluster centres

used to iterate over classes. Instances from each single class are clustered and deliv-ered to the inner process in a loop (fig.(11(b))). At the end the obtained prototypes areautomatically combined into a final set of prototypes.

In the next example an LVQ network will be used to select the prototypes. The

15

LVQ algorithm has many advantages and the most important one is the ability to useoptimized codebooks (prototypes) according to the classification performance. The useof LVQ network is equivalent to the previous examples except that the LVQ networkrequires initialization. For the initialization any instance selection or construction al-gorithm can be used. The typical initialization is based on random codebooks selectionso the Random selection operator can be used as shown in fig.(12(a)).

(a) Main process configuration (b) Random initialization ofLVQ operator

Figure 12: Process of prototype construction based on clustering of each class andcombing obtained cluster centres

In a more complex scenario we can use the series of LVQ processing with proto-types initialized using VQ clustering. The tree structure of this process is shown infig. (13) The process starts with loading the data and then the Validation operator is

Figure 13: Process tree of prototype construction based on LVQ algorithm

executed to estimate the accuracy of the model. In the training part of the Validationthe k-NN operator uses the prototypes obtained with the LVQ2.1 algorithm. The ob-tained prototypes are then filtered by the GE selection algorithm because some of thedetermined codebooks may be useless. The LVQ2.1 algorithm is initialized by the pro-totypes obtained from LVQ1 algorithm, which in turn is initialized by the prototypesfrom the Class Iterator where vectors from each class were clustered using the VQ

16

algorithm starting at random. Always the number of prototypes is determined by theinitialization operator. So if the Random selection is set to determine 10 prototypes andthe problem is a three class classification task the resulting set of codebooks obtainedby the LVQ algorithm will include 30 prototypes.

In the last example we will use the Weighted LVQ algorithm [3], which during thetraining takes care of the example weights. The weights can be determined in differentways but if we are interested in the model accuracy we can use the ENN Weighting op-erator, which assigns weights according to the mutual relations between classes. Thisoperator generates a new weights attribute which is used by the Weighted LVQ algo-rithm. This weighting scheme assigns weights according to the confidence of the ENNalgorithm, which is also used for instance selection. The instance weight takes valueclose to 1 when all neighbor instances are from the same class and close small valuesfor instances which lie on the border. The weights should be then transformed with theWeight Transformation operator, for example using the formula exp(−pow(weight−0.5, 2) ∗ 0.5). A comparison of WLVQ algorithm, LVQ2.1 and LVQ1 algorithm will be

Figure 14: Process tree of prototype construction based on the LVQ algorithm

provided for the SpamBase dataset using the process in fig.(14). In this example in or-der to speed up the process the partial results are stored in a cache using the Rememberoperator and retrieved by the Recall operator. The partial results can be used for theinitialization of the LVQ algorithms. To analyze the process we focus on the trainingsubprocess of the Validation step. First the codebooks initialization is calculated usingthe Class iterator operator and the VQ clustering algorithm. The obtained results arecached using the Remember operator, then the trainings with the LVQ algorithms be-gin. The first one starts with determining weights by the ENN Weighting then weightsare transformed as discussed earlier by the Weight Tranformer operator. The trainingset prepared in this way is then used to train the WLVQ model. The initialization of theWLVQ algorithm is carried out with Recall operator using the previously stored values.The final k-NN model is obtained directly from the LVQ output ports. Identically thenext LVQ1 model is trained and its k-NN model is delivered to the output of the sub-process. Moreover, the obtained prototypes are also stored in the cache as in the caseof initialization of the LVQ2.1 algorithm.

5 Regression ProblemsSo far we have adjusted three instance selection algorithms for regression tasks (ENN,ENN and CA) and in a future version of the ISPR extensions even more algorithmswill be able to deal with regression problems. The "correct classification" used in clas-sification problems is replaced here with "prediction error lower than some threshold".

17

The threshold maxdY expresses the maximum difference between the output values oftwo vectors to consider them similar and can be either constant or variable. In the latercase it is proportional to the standard deviation of k nearest neighbors of the given vec-tor, which reflects the speed of changes of the output around in that point and allowsdynamically adjusting the threshold to the local landscape, what, as the experimentsshowed, allows for better compression of the dataset. In the CA algorithm the "near-est instances of the same class" used in classification is replaced by nearest instancesaccording to the weighted sum of distances in the input and output (label) space.

You create the processes for regression tasks exactly in the same way as for classifi-cation tasks. The only difference is that in the operator properties you can additionallyadjust the additional properties described above (thresholds, distance measures).

6 Mining Large Data SetsNowadays more and more data is logged into databases and mining such a large datasetsbecomes really a challenge. Most of prediction models have polynomial computational(quadratic or cubic) complexity, for example similarity based methods like kNN, SVM,etc. So a question arise how to deal with such large datasets and how to process themwith current computers. One of the solutions is to redistribute the mining process be-tween many computing nodes like it Machout, a sub project of Apache Hadoop. Wecan take advantage of it using the Radoop extension to RapidMiner. Another approachis to use updateable models, which update the model parameters according to the newlypresented sample, so they scale linearly with the size of the dataset. A disadvantage ofthe updateable models is usually poor generalization abilities. Perhaps the simplest so-lution is resampling the data to a given number of samples, but then we may lose someimportant information. The last method is similar to resampling but the samples are notrandom, instead they are obtained with instance selection and construction methods. Astate of the art approach is based on initial clustering of the data for example using thek-means algorithm. In this approach instead of the original data, the prediction modelis trained using cluster centers. However, this is also not the best solution. Cluster-ing does not take advantage of the label information so examples, which lie close tothe decision border may be grouped into a single cluster. Also vectors which are farfrom the decision border are less important for the prediction algorithms caring uselessinformation. These instances should be pruned but clustering algorithms cannot do itand usually produce many useless prototypes from the prediction model perspective,reducing the information gain of the pruned datasets.

Instance selection algorithms are dedicated to prune useless instances using the la-bel information, so the ISPR extension can be very useful for mining big datasets. Theidea of using instance selection methods to train prediction models on the preselectedinstances was already presented in [7, 6] but the aim was to improve the quality of thelearning process. Now we can also use them as filters, similarly to the feature selectionfilters, and prune some of the useless data thus reducing the size of the data and filteringnoise.

The only drawback of the instance selection methods is the computational com-plexity, which in most of the cases is quadratic, so the gain of using them is obtained

18

afterwards be precisely optimizing the prediction model. Sample results are presentedin fig (16), where the dataset containing over 20.000 instances is used to show the de-pendencies between the computational complexity and the obtained accuracy. In the

(a) Main Process

(b) Loop Parameters subprocess

(c) Validation subprocess

(d) Select Subprocess operators configuration

Figure 15: Process used to evaluate the time complexity of the instance selection meth-ods

process presented in fig. (15(a)) first the dataset was loaded into the memory then nor-malized to fit the range [0, 1]. Then the main loop Loop Parameters starts and set thenumber of examples sampled from the dataset. Inside the operator (fig. (15(b))) firstthe data is sampled using the Sample (stratified) operator. The number of samples isset by the Loop Parameters. Then the subset is rebuilt in memory using the MaterializeData operator. Then the final loop iterates over different combinations of models. Themodels are evaluated with the X-Validation operator. Inside the validation loop (fig.(15(c))) the Select Subprocess operator is used to switch between different combina-

19

tions of models which are shown in fig.(15(d)). The evaluated algorithms include apure SVM (in all cases for the simplicity the LibSVM implementation was used withthe default settings), CNN instance selection followed by the SVM, combination ofENN and CNN as a pruning step before applying an SVM model, and finally a pureENN algorithm as a filter for the SVM model. In the testing subprocess of the vali-dation the obtained model was applied and the performance was measured. The Logoperator was used to measure the execution time of the whole training process includ-ing both example filtering and training of the SVM model, and the execution time ofthe Apply Model operator. Also the size of the dataset and the identifier of selectedprocessing scheme were logged. Finally, after all the calculations were finished in theMain Process the logged values were converted into a dataset and averaged accord-ing to the processing scheme and size of the data using the Aggregate operator. Touse the Advanced Charts of the ExampleSet visualization methods we need to convertthe processing scheme identifier, which is a number into the nominal value using theNumerical to Polynomial operator and finally map the obtained values do appropriatenames.

From the results we can see that the preprocessing based on CNN algorithm has thelowest computational complexity, but unfortunately also the lowest accuracy. From theperspective of the quality of the prediction model, the combination of ENN+CNN leadsto the best accuracy but it has also the highest time complexity. The ENN preprocessinghas similar complexity to the original SVM, but the average accuracy is a little higherthan the one of the pure SVM model. In many applications also the prediction time isvery important. The CNN and a combination of ENN+CNN algorithms decrease thetime significantly. This is due to the high selection ratio of both these algorithms, whichresults in a much lower number of support vectors needed to define the separatinghyperplane.

A disadvantage of the instance selection method is the computational complexity.This weakness is currently under the development of the future release of the ISPRextension, but it can be also overcome by instance construction methods. As alreadydiscusses, the clustering algorithms are not a good solution for the classification tasks,but the LVQ algorithm is. It has a liner complexity as a function of the number of sam-ples and attributes for fixed number of codebooks and it takes advantage of the labelinformation instance selection algorithms the LVQ operator was used. The compari-son of the preprocessing based on the LVQ and the k-means clustering is shown in fig.(17). The obtained results presented in fig (18) shows linear time complexity. Thiswas obtained for both the LVQ and the k-means algorithm, while the pure SVM scalesquadratically. Similarly the prediction time due to the fixed number of prototypes ob-tained after both prototype construction methods is linear or even constant. The plotshowing the accuracy as a function of the size of the data visualize the gain of usingthe LVQ instance construction over the clustering based dataset size reduction.

7 SummaryIn this chapter we have shown usage examples of instance selection and constructionmethods which can be obtained from the ISPR extension. The presented examples

20

(a) Accuracy

(b) Training Time

(c) Testing Time

Figure 16: Accuracy, training time and testing time in a function of the sample sizeusing different instance selection methods followed by the SVM

21

Figure 17: Select Subprocess operators configuration used to evaluate the time com-plexity of the instance construction methods

also show some advances in building RapidMiner processes, which can be used as abaseline to evaluate the quality of these and other prediction methods. We hope that thiswill be useful for users involved in distance or similarity based learning methods butalso for those who are interested in noise filtering. In this extension we have adoptedsome of the old algorithms to meet the requirements of the regression problems. Theset of available operators can be also useful for mining large datasets transforming ofchanging big data into small data. This makes mining large datasets with the state ofthe art techniques feasible. Most of the available operators implement the algorithms,which were not designed to minimize the computational complexity, so the challenge isto create even better and more efficient instance filters. We plan to constantly developthe ISPR extension and release future versions.

References[1] D. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine

Learning, 6:37–66, 1991.

[2] A. Asuncion and D.J. Newman. UCI machine learning repository.http://www.ics.uci.edu/∼mlearn/MLRepository.html, 2007.

[3] M. Blachnik and W. Duch. Lvq algorithm with instance weighting for generationof prototype-based rules. Neural Networks, 2011.

[4] R. Cameron-Jones. Instance selection by encoding length heuristic with randommutation hill climbing. In Proc. of the Eighth Australian Joint Conference onArtificial Intelligence, pages 99–106, 1995.

[5] W. Duch and K. Grudzinski. Prototype based rules - new way to understand thedata. In IEEE International Joint Conference on Neural Networks, pages 1858–1863, Washington D.C, 2001. IEEE Press.

[6] M. Grochowski and N. Jankowski. Comparison of instance selection algorithms.ii. results and comments. LNCS, 3070:580–585, 2004.

[7] N. Jankowski and M. Grochowski. Comparison of instance selection algorithms.i. algorithms survey. Lecture Notes in Computer Science, 3070:598–603, 2004.

22

(a) Accuracy

(b) Training Time

(c) Testing Time

Figure 18: Accuracy, training time and testing time in a function the sample size usingSVM and two instance construction algorithms LVQ and K-means

23

[8] A. Kachel, W. Duch W, and M. Blachnik nad J. Biesiada. Infosel++: Informationbased feature selection c++ library. Lecture Notes in Computer Science, (in print),2010.

[9] L. Kuncheva and J.C. Bezdek. Presupervised and postsupervised prototype clas-sifier design. IEEE Transactions on Neural Networks, 10(5):1142–1152, 1999.

[10] L.I. Kuncheva. Fuzzy Classifier Design. Physica-Verlag, 2000.

[11] D.B. Skalak. Prototype selection for composite nearest neighbor classifiers. PhDthesis, University of Massachusetts Amherst, 1997.

List of Figures1 Equivalent examples of applications of Instance Selection operators (a)

use of internal k-NN model (b) use of selected instances to train exter-nal k-NN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Validation of the accuracy and compression of the GE Selection operator 83 Selecting the most useful instance selection operator. Validation inner

process configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Selecting the most useful instance selection operator Main process . . 95 Selecting the most useful instance selection operator Validation inner

process configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Processing many datasets at once . . . . . . . . . . . . . . . . . . . . 107 Test of the outlier elimination task . . . . . . . . . . . . . . . . . . . 118 Process configuration used to compare different instance selection sce-

narios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Comparison of three different combinations of instance selection algo-

rithms (CNN, ENN+CN, ENN+CNN+RNG. Accuracy as a function ofCompression plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

10 Process of prototype construction based on clustering and relabeling . 1511 Process of prototype construction based on clustering of each class and

combing obtained cluster centres . . . . . . . . . . . . . . . . . . . . 1512 Process of prototype construction based on clustering of each class and

combing obtained cluster centres . . . . . . . . . . . . . . . . . . . . 1613 Process tree of prototype construction based on LVQ algorithm . . . . 1614 Process tree of prototype construction based on the LVQ algorithm . . 1715 Process used to evaluate the time complexity of the instance selection

methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916 Accuracy, training time and testing time in a function of the sample

size using different instance selection methods followed by the SVM 2117 Select Subprocess operators configuration used to evaluate the time

complexity of the instance construction methods . . . . . . . . . . . . 2218 Accuracy, training time and testing time in a function the sample size

using SVM and two instance construction algorithms LVQ and K-means 23

24

List of Tables1 List of datasets used in the experiments . . . . . . . . . . . . . . . . 22 Comparison of various instance selection methods. acc denotes classi-

fication accuracy, compr denotes compression . . . . . . . . . . . . . 103 Comparison of the influence of noise on the accuracy with and without

ENN noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

25

instance selection and prototype-based rules in...

Documents