keelreferencemanualv1.0

8/13/2019 KeelReferenceManualV1.0

1/32

Knowledge Extraction based on

Evolutionary Learning

Reference manual

Version: 1.0.Date: 20-5-2009.


2/32

CONTENTS CONTENTS

Contents

1 Basic KEEL developement guidelines 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Developing a new method . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Reading of the conguration le . . . . . . . . . . . . 41.2.2 Developement of the method . . . . . . . . . . . . . . 51.2.3 Writing the output les . . . . . . . . . . . . . . . . . 51.2.4 Registering the method in KEEL . . . . . . . . . . . . 51.2.5 Making the use case les . . . . . . . . . . . . . . . . . 8

1.2.6 Building the executables . . . . . . . . . . . . . . . . . 9

2 Method Description les 102.1 Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Method Conguration les 133.1 Input les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Output les . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Data les 154.1 Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Output les 18

5.1 Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Use Case les 206.1 Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

KEEL - Reference Manual Page 1 of 31


3/32

CONTENTS CONTENTS

6.2 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.3 General description . . . . . . . . . . . . . . . . . . . . . . . . 216.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.5 Example of use . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 API Dataset 247.1 Data les grammar . . . . . . . . . . . . . . . . . . . . . . . . 247.2 Semantic restrictions of the data les . . . . . . . . . . . . . . 26

7.2.1 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 267.2.2 Inputs and outputs denition . . . . . . . . . . . . . . 26

7.2.3 Missing values . . . . . . . . . . . . . . . . . . . . . . 277.2.4 Train and test les . . . . . . . . . . . . . . . . . . . . 27

7.3 Description of the classes . . . . . . . . . . . . . . . . . . . . . 287.3.1 InstanceSet . . . . . . . . . . . . . . . . . . . . . . . . 287.3.2 Instance . . . . . . . . . . . . . . . . . . . . . . . . . . 287.3.3 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 307.3.4 Attribute . . . . . . . . . . . . . . . . . . . . . . . . . . 30



4/32

Basic KEEL developement guidelines

1 Basic KEEL developement guidelines

1.1 Introduction

The purpose of this document is to describe some basic concepts about thestructure of KEEL (Knowledge Extraction based on Evolutionary Learning)and the format of its internal les.

The aim of this section is to present the KEEL framework, describingsome guidelines to help a potential developer to build new methods insidethe KEEL environment. The next sections will deal with the formats of theconguration les of KEEL (including data sets les, method descriptionsand so on). Finally, the last section describes the API dataset of KEEL, whichis used to handle and check the data sets les.

1.2 Developing a new method

Before to start the task of developing a new method inside of KEEL envi-ronment, some operations have to be performed in order to fully integrateit. By following this guidelines, a developer can left all the input/outputoperations to be accomplished by KEEL environment, focusing its efforts inthe construction of the method itself.

The steps needed to complete the integration of a new method in KEELare:

Reading of the conguration le.

Development of the method.

Writing the output les.

Registering the method in KEEL.

Making the use case les.

Building the executables of the method.



5/32

Basic KEEL developement guidelines Developing a new method

1.2.1 Reading of the conguration le

The KEEL methods only accept one parameter: The name and path of aconguration le. A typical main class of a method can be the following:

p ub li c c l a s s Main {

p ri va te s t a t i c Clas c l a s s i f i e r ;

p ub li c s t a t i c void main ( S t r i ng a rgs [ ] ) {

i f ( a rg s . l ength != 1 ) {

System . err . p r i nt ln ( Error . ) ;} e l s e {c l a s s i f i e r = new Clas ( a rgs [ 0 ] ) ;c l a s s i f i e r . execu te ( ) ;

}

} / / end method} / / end c l a s s

The conguration le contains information about the input and outputles of the method. In addition, it contains the values for all the parametersdened. A full description of the conguration les can be found in section3.

By interpreting this le, the method should be able to acquire the correctvalues of its parameters, including the seed to initialize the random numbergenerator if the method needs it.

Also, the names and paths of the input and output les are specied inside.Usually, a KEEL method employs two input data les: The training le,containing the data set which should be employed in the train phase of themethod, and the test le, containing the data set which should be employedin the test phase. In addition, any method excepting the preprocessingmethods and the test methods specify a third le, the validation le. This

le contains a copy of the original dataset of the experiment, which can beused in comparisons with the train data.The format of the data les is explained in section 4. These les must be

handled with care, because they will be employed not only by the method,



6/32


but also by the KEEL API dataset (see section 7) in order to load and checkthe data in an efcient way.

Any KEEL method must dene at least two output les: A train outputle and a test output le. In addition, it is possible to dene additionaloutput les in the conguration le. They will be explained in the nextsubsections of this guide.

1.2.2 Developement of the method

The development of the method can be done in any programming environ-ment. The only requirements are: The method must be developed with the Java programming language, and it must employ a package structure whoseroot will be the keel/src directory, where the sources of any KEEL methodare located.

1.2.3 Writing the output les

As is explained before, at least two output les must be produced by themethod (the train output le, and the test output le). Its format is describedin section 5.

If it is desired to employ additional output les, they also can be createdat the end of the execution on the method. These additional les will get

its name from the conguration le. Also, it is important to note that, inorder to let the KEEL GUI automatically generate the names of these les,the number of additional outputs of the methods must be placed in thecorresponding method description le.

1.2.4 Registering the method in KEEL

When the method have been fully coded, it must be registered in the KEELconguration les, to allow the KEEL GUI to employ the new method.

The rst step is to create a method description le. The format of theseles is fully described in section 2.

The second step involves modifying the master description le of eachcategory method. Currently, 11 categories are dened:

Discretization


7/32


Educational Methods

Educational Preprocess

Feature Selection

Instance Selection

Method

Postprocess

Preprocess

Tests

TransOthers

Visualize

When the correct master description le have been found (please, ask to aKEEL project manager if it is not clear what le have to be modied), a newregistry containing the denition of the method must be created. The KEELmaster description le registers have the following structure:

< method >HeaderInputOutput

< / method >

The header is composed by four nodes:

< name > Disc UniformWidth < / name >< family > D i s c r e t i z e r s < / f amily >< j a r f i l e > Disc UniformWidth . j a r < / j a r f i l e >< problem type > unspec i f ied < / proble m type >

Name: The name of the method.

Family: The category of the method.

Jar File: The name of the Jar le which contains the method.

8/32


Problem Type: The class of problems which can manage the method. Thereare dened 4 classes:

Classication , for supervised classication problems. Regression , for regression problems. Unsupervised , for unsupervised classication problems (e.g. clus-

tering). Unspecied , for any problem (supervised classication, unsu-

pervised classication or regression).

The input and output parts denes the types of data which the method

is able to manage, both in input data and output data. Their elds mustspecify which types are allowed, by employing yes and no keywords.A description of the elds is shown as follows:

< continuous > Yes< / continuous > Yes< / in t ege r >< nominal > Yes< / nominal >< missing > Yes< / miss ing >< imprecise > No< m ul t i c l a s s > Yes< / m ul t i c l a ss >< multioutput > No< / multioutput >

Continuous: The method is able to work with continuous values.

Integer: The method is able to work with integer values.

Nominal: The method is able to work with nominal values.

Missing: The method is able to handle missing values.

Imprecise Value: The method is able to work with imprecise values.

Multiclass: The method is able to work with problem which denes morethan 2 classes.

Multioutput: The method is able to work with data which denes morethan 1 output for each instance.

9/32


When the header, input and output sections were completely dened, thenthe new registry can be place inside the corresponding master descriptionle. Below is shown a valid example of registry:

< method >< name > Disc UniformWidth < / name >< family > D i s c r e t i z e r s < / f amily >< j a r f i l e > Disc UniformWidth . j a r < / j a r f i l e >< problem type > unspec i f ied < / prob lem type >< input >

< continuous > Yes< / continuous > Yes

Yes

< missing > Yes< / miss ing >< imprecise > No< mul t i c l a s s > Yes< / mu l t i c l a ss >

< multioutput > No< / multioutput >< / input >< output >

< continuous > No< / continuous > No< nominal > Yes< / nominal >< missing > Yes< / miss ing >< imprecise > No< mul t i c l a s s > Yes< / mu l t i c l a ss >

< multioutput > No< / multioutput >< / output >

< / method >

1.2.5 Making the use case les

When developing a new method, it is important to document properly itsfunctions and objectives. Also, the users should be able to look up relevantinformation about the method (a brief description, some references, thedescription of its parameters, etc.) when they select the method in KEEL.

To manage this information, the KEEL GUI denes the use case les,which are XML les containing all the relevant information needed to em-ploy any KEEL method. A full description of the use case les can be foundin section 6.

10/32


1.2.6 Building the executables

When the method was fully developed, and its relevant conguration leshave been created, the last step is to add it to the build.xml le (a ANT scriptle), so the new versions of KEEL could be able to build it inside the KEELenvironment. The build.xml is a critical le, so it is not recommended tomodify it without authorization of a KEEL project manager.

The build.xml changes dynamically with any new version of KEEL, thusits is not possible to fully describe its structure here. However, it is possibleto describe which part of the le must be changed to allow the inclusion of new methods.

Firstly, the jar target must be found. It should have the following struc-ture:

< ta rg et name=j ar depends=compiled e sc r i p t i on=Bu ild j a r s >

The jar target is composite by a great number of tasks, every one dealingwith the construction of a jar le for each method. Inside this target, theconstruction of the new jar le must be described as another task. Here is avalid example:

< j a r

j a r f i l e = $ { distMet } /KNN. j a r manifest=$ { s r c } / k eel / Algor ith ms / Lazy Learning /KNN/ Manifest >

< f i l e s e t d ir =$ {bin } i nclud es= ke el / Algo rithms / Laz y Le arnin g /KNN/ / . c las s / >

< f i l e s e t d ir =$ {bin }includ es=keel / Algorithms / Prepr ocess / Basic / / . c l a s s

org / co re / / . c l a s skeel / Data set / / . c l a s skeel / Algorithms / Lazy Learning / . c las s / >

< / j a r >

The task must dene the locations of the new jar le and their corre-sponding manifest le. Also, must include the les from the classes whichcompose the method. Also, the les from the imported classes are requiredto fully describing the task.

11/32

Method Description les

2 Method Description les

Every method in KEEL (e.g., a preprocess method, a test . . . ) has assigned aXML le which describes its main characteristics. This le will be employed by the KEEL GUI to allow the user to select the values of the parameters of any execution of the method.

The KEEL Method Description les are located under the ../dist/algorithmdirectory, inside of the folder where its associated .JAR le is generated(e.g., ../dist/algorithm/methods ). Each Method Description le is an XMLcomposed by a unique root node, < algorithm specication > . This nodeis divided into two parts:

< a l g o r i t h m s p e c i f i c at i o n >HeaderParameters

< / a l go r i t h m sp e c i f i c a t i on >

Header : Basic information about the method.

Parameters : A list of parameters of the method.

2.1 Header

The header is composed by four nodes:

< name > K N e ar es t N eig hbo rs C l a s s i f i e r < / name >< nParameters > 2< / nParameters >< seed > 0< / se ed >< nOutput > 1< / nOutput >

< Name > : The name of the method.

< nParameters > : The number of parameters of the methods (must be0 or highger). Seed values employed to initialize random numbergenerators are not counted here.

< Seed > : Denes if the method will need a seed to initialize a randomnumber generator. Valid values are 1, if a seed is needed, or 0, if not.

12/32

Method Description les Parameters

< nOutput > : The number of additional output les which will be gener-ated by the method.

2.2 Parameters

The parameters of the method are listed consecutively. A < parameter >node is employed to describe each one. Each < parameter > is composed bythe following nodes:

< parameter >< name > K Value < / name >< type > i n t e g e r < / type >< domain >

< lowerB > 1< / lowerB >< upperB > 100< / upperB >

< / domain >< defau l t > 1< / de f au l t >

< / parameter >

< Name > : The name of the parameter.

< Type > : Type of parameter. KEEL denes four valid types:

integer : An integer value. Can be positive, 0, or negative.real : A real value. The dot . is employed as decimal separator.text : A string of text.list : A predened list of text options

When employing text parameters, no checking operations are done by the KEEL GUI. Thus, the use of list parameters is recommendedwhen a xed number of text options are dened, so the method doesnot have to check the parameters by itself.

< Domain > : The domain of the parameter. For list parameters is manda-

tory. For text

parameters cannot be dened. For integer

and real

parameters is optional (if it is not dened, the KEEL GUI will notcheck its value).

< lowerB > : The lower value of the parameter (valid only in integerand real parameters).

13/32

Method Description les Example of use

< upperB > : The highest value of the parameter (valid only in integerand real parameters).

< item > : A text value for the parameter (it can be employed only inlist parameters).

< Default > : Default value of the parameter.

2.3 Example of use

This is a valid example of a Method Description le:

< a l g o r i t h m s p e c i f i c at i o n >< name > K N e ar es t N eig hbo rs C l a s s i f i e r < / name >< nParameters > 2< / nParameters >< seed > 0< / se ed >< nOutput > 1< / nOutput >< parameter >

< name > K Value < / name >< type > i n t e g e r < / type >< domain >

< lowerB > 1< / lowerB >< upperB > 100< / upperB >

< / domain >< defau l t > 1< / de f au l t >

< / parameter >< parameter >

< name > Dis tance Funct ion < / name >< type > l i s t< / type >< domain >

< item > Manhattan < / it em >< item > Euclidean < / it em >

< / domain >< defau l t > Euclidean < / d e f au l t >

< / parameter >

< / a l go r i t h m sp e c i f i c a t i on >

14/32

Method Conguration les

3 Method Conguration les

In KEEL, every method uses a conguration le to extract the values of the parameters which will be employed during its execution. Althoughit is generated automatically by the KEEL GUI (by using the informationcontained in the corresponding method description le, and the values of the parameters specied by the user), it is important to fully describe itsstructure because any KEEL method must be able to completely read it, inorder to get the values of its parameters specied in each execution.

Each conguration le has the following structure:

algorithm : Name of the method.

inputData : A list with the input data les of the method.

outputData : A list with the output data les of the method.

parameters : A list of parameters of the method, containing the name of each parameter and its value (one line is employed to each one).

3.1 Input les

The les in the inputData list must be separated by one space, being eachone surrounded by quotation marks (). If a validation le is employed bythe method, the les will appear in the following order:

input data = < training le > < validation le > < test le >

If not, the order employed will be:

input data = < training le > < test le >

The validation le is a copy of the original train data employed at the startof the experiment. It is often employed for comparison tasks between the

initial data and training data when it has been preprocessed.


15/32

Method Conguration les Output les

3.2 Output les

A le with the output for training data.

A le with the output for test data.

(Optional) Additional output les.

Additional output les can be specied in this list. Although the will be not managed by other KEEL methods, it is possible to employ them toextract useful information from the execution of the method (representationof models extracted, additional outputs, performance measures . . . ). Theymust be marked with the extension .txt.

3.3 Parameters

The rest of the conguration le is used to describe the values of the param-eters. In each line appears one parameter, followed by a = sign and itsassigned value.

If the method needs a seed to initialize a random number generator, itmust be the rst parameter described, employing seed as name of theparameter.

3.4 Example of use

This is a valid example of a Method Conguration le (data les lists arenot fully shown):

algorithm = g e n e t i c algorithminputData = . / d a t a s e t / i r i s / i r i s 1 1 . d at . . .outputData = . / r e s u l t /ga/ i r i s / r e s u l t 1 . t r a . . .

seed = 1234578nGenera t ions = 500c r o s s = t w o p o i n t sc r o s s P r o b a b i l i t y = 0 . 6m u t a t i o n P r o b a b i l i t y = 0 . 2



16/32

Data les

4 Data les

In KEEL, the data sets are managed by plain ASCII text les, with the .datextension. Usually, they are located under the ../dist/data directory, eachone in its own folder (which also should contains the partitions created fromthe whole data set). In addition, preprocess methods will also create datales as its output, which will be placed on the ../datasets directory of itsexperiment.

This section describes the format employed to dene them (which isfairly similar to WEKA arff format). Each KEEL data le is composed by 2sections:

Header : Basic metadata describing the data set.

Data : Content of the dataset.

In both sections it is possible to insert comments, by employing the%character.

4.1 Header

The header is composed by the following metadata:

@rela t ion bupa2@at t r i bu t e mcv nominal {a , b , c}@at t r i bu t e a lk ph os i n t e g e r [ 2 3 , 1 3 8 ]@at t r i bu t e s g pt i n t e ge r [ 4 , 1 5 5]@at t r i bu t e s g ot i n te g er [ 5 , 8 2]@at t r i bu t e gammagt in te ger [5 , 297]@at t r i bu t e d rin ks r e a l [ 0 . 0 , 2 0 . 0 ]@at t r i bu t e s e l e c t o r {t rue , f a l s e }@inputs mcv, alkphos , sgpt , sgot , gammagt , dr inks@outputs s e l e c t o r

@relation : The name of the data set.

@attribute : Describes one attribute of the data (a column). It is possible todene three different types of attributes:



17/32

18/32

Data les Example of use

If the dataset corresponds to a regression problem, the output type must be real:

. . .@at t r i bu t e s e l e ct o r r e al [ 0 . 0 , 2 0 . 0]. . .@outputs s e l e c t o r@dataa , 9 2 , 4 5 , 2 7 , 3 1 , 0 . 0 , 0 . 9a , 6 4 , 5 9 , 3 2 , 2 3 , < null > , 1 7 . 5 b , 54 , < null > , 1 6 , 5 4 , 0 . 0 , 3 . 5

4.3 Example of use

This is a valid example of a data le:

@rela t ion bupa2@at t r i bu t e mcv nominal {a , b , c}@at t r i bu t e a lk ph os i n t e g e r [ 2 3 , 1 3 8 ]@at t r i bu t e s g pt i n t e ge r [ 4 , 1 5 5]@at t r i bu t e s g ot i n te g er [ 5 , 8 2]@at t r i bu t e gammagt in te ger [5 , 297]@at t r i bu t e d rin ks r e a l [ 0 . 0 , 2 0 . 0 ]@at t r i bu t e s e l e c t o r {t rue , f a l s e }@inputs mcv, alkphos , sgpt , sgot , gammagt , dr inks@outputs s e l e c t o r@dataa , 9 2 , 4 5 , 2 7 , 3 1 , 0 . 0 , t r uea , 6 4 , 5 9 , 3 2 , 2 3 , < null > , f a l s e b , 54 , < null > , 1 6 , 5 4 , 0 . 0 , f a l s ea , 7 8 , 3 4 , 2 4 , 3 6 , 0 . 0 , f a l s ea , 5 5 , 1 3 , 1 7 , 1 7 , 0 . 0 , f a l s e b , 62 , 20 , 17 , 9 , 0 . 5 , t r uec , 6 7 , 2 1 , 11 , 11 , 0 . 5 , t r uea , 5 4 , 2 2 , 2 0 , 7 , 0 . 5 , t r ue b , 60 , 25 , 19 , 5 , 0 . 5 , t r ue b , 52 , 13 , 24 , 15 , 0 . 5 , t rue b , 62 , 17 , 17 , 15 , 0 . 5 , t rue

19/32

Output les

5 Output les

Every method in KEEL must produce at least two output les: A train resultsle (marked with the extension .tra) and a test results le (marked with theextension .tst). Although the method can employ additional output les toshow more information about the process performed, those additional lesmust be handled entirely by the method. Thus, KEEL will only handle thetwo standards output les.

Both output les share the same structure: They are composite by thesame header of the data employed as input of the method, and a set of rows (one for each instance of the data set) describing the expected outputsand the outputs obtained by the application of the method. Thus, they arestructured as follows:

< Expected 1,1 > . . . < Expected 1,n >< Method 1,1 > . . . < Method 1,n >< Expected 2,1 > . . . < Expected 2,n >< Method 2,1 > . . . < Method 2,n >

5.1 Example of use

As related before, the structure of the output method is derived from theinput data les employed. For example, if the following le is employed asinput data of a method:

@rela t ion banana@at t r i bu t e a t 1 r ea l [ 3 .0 9 , 2 . 8 1 ]@at t r i bu t e a t 2 r ea l [ 2 .3 9 , 3 . 1 9 ]@at t r i bu t e c l a s s { 1 .0 , 1 . 0}@inputs a t 1 , a t 2@outputs c l a s s@data1 .14 , 0.114, 1.0 1 .0 5 , 0 . 7 2 , 1.0 0 .9 16 , 0 . 3 9 7 , 1 . 0 1 .0 9 , 0 . 4 3 7 , 1 . 0 0 .5 84 , 0 . 0 9 3 7 , 1 . 01 . 8 3 , 0 . 4 5 2 , 1.0 1.25 , 0 .2 86 , 1 . 0. . .


20/32

21/32

Use Case les

6 Use Case les

The use case les of KEEL provides valuable information to understandingevery of the methods which are available to use. They are XML les, locatedin the ../dist/help directory.

Each KEEL use case le is composed by 4 sections:

< method >NameReferenceGeneral Descr ip t ionExample

< / method >

Name : The name of the method.

Reference : A list of references associated with the method.

General Description : Generic information about the method.

Example : A example about the use of the method.

6.1 Name

The rst part of the use case contains the name of the method, enclosed by< name > tags:

< name > Name of the method < / name >

22/32

Use Case les Reference

6.2 Reference

The second part of the use case contains a list of associated references. Theyare enclosed each one by < ref > tags.

< r e f e r ence >< r e f > F i r s t r e fe r en c e < / r e f>< r e f > Second re fe rence < / r e f >

< / r e f e r ence >

6.3 General description

The general description describes some common features about the method,as its objective, parameters, type of data which can be handle, etc. It iscomposed by the following elds:

Type: General type of method.

Objetive: Objective of the method.

How work: A brief explanation about how the method works.

Parameter spec: A specication of each parameter of the method. They are

enclosed each one by < param > tags.Properties: Generic properties of the methods. Each eld can contain yes

or no strings, dening the following capabilities of the method:

Continuous: The method is able to work with continuous values.Discretized: The method is able to work with discretized values.Integer: The method is able to work with integer values.Nominal: The method is able to work with nominal values.Value Less: The method is able to handle missing values.Imprecise Value: The method is able to work with imprecise values.

23/32

Use Case les Example

< genera lDescr ip t ion >< type > General type of method . < / type >< o b j ec t i ve > Objec t ive o f t he method . < / ob j ec t i ve >< howWork > Expla natio n of how works . < /howWork >< parameterSpec >

< param > Parameter one < / param >< param > Parameter two < / param >

< / parame te rSpec >

< continuous > Yes< / continuous >< d i s c r e t i z e d > Yes< / d i s c r e t i z e d >

Yes

< nominal > Yes< / nominal >< valueLess > Yes< / v alueLess >< impreciseValue > Yes< / impreciseValue >

< / g enera lDescr ip t ion >

6.4 Example

The last part of the use case is employed to show an example of utilization

of the method. It can be employ any number of lines (though, it is notrecommended to place huge examples here).

< example >A example o f u t i l i z a t i o n o f t h e method .< / example >

6.5 Example of use

A valid example of a use case le is shown in the next page:

24/32

Use Case les Example of use

< method >< name > Prototype Nearest Neighbor < / name >< r e f e r ence >< r e f > Chin Liang , Chang. F inding Pro to types forNea res t Neighbor C l a s s i f i e r s . IEEE Trans .

on Computers , vol . c 23, No. 11 , 1179 1184.< / r e f >< / r e f e r ence >< genera lDescr ip t ion >

< type > Prepr oces s Method . < / type >< ob j ec t i ve > Reduce t he s i z e o f t he t r a i n i ngs e t w it ho ut l o s i n g p r e c i s i o n o r a c c ur ac y

in order t o a p os te r i or c l a s s i f i c a t i o n

< howWork > This a lgo r it hm merge nea re s t p ro to typeso f t he s et , i f c l a s s i f i c a t i o n accura cy o f theo r i g i n a l t r a i n i ng d at a s e t does n ot d ec re as e a ndi f t he y have g ot t he same c l a s s .I f not , t hey a r e removed from the s e t . < /howWork >

< parameterSpec >< param > Percen tage o f p ro to types : Rea l < / param >

< / parame te rSpec >

< continuous > Yes< / continuous >< d i s c r e t i z e d > Yes< / d i s c r e t i z e d > Yes< nominal > Yes< / nominal >< valueLess > No< / v alueLess >< impreciseValue > No< / impreciseValue >

< / g enera lDescr ip t ion >< example >Problem t yp e : C l a s s i f i c a t i o nMethod : PGPNND at as et : i r i sPa rame ter s : de fau l t va lues

We can see outpu t s e t i n Exper imen t \ Resu l t s \ PGPNN:. . .

< / example >< / method >


25/32

API Dataset

7 API Dataset

One of the main components of KEEL is the API Dataset. It manages theentire process of acquisition, processing and validation of the data les,offering the data sets to the developer in a suitable way, freeing him fromthe task of acquiring the data needed to perform any experiment.

This section describes three key concepts of the API Dataset:

Data les grammar . The grammar employed to dene the data les.Any le generated by this grammar will be a valid data le, accordingto the rules shown in section 4.

Semantic restrictions of the data les . Apart from the syntax restric-tions, some semantic verications are performed by the API Datasetover the data les.

Description of the classes . To close this section, the main publicclasses of the API Dataset are described.

7.1 Data les grammar

In this subsection is shown the grammar which describes the format of theKEEL data les. The nal tokens of the grammar are:

{} . Denotes the void production. It is also known as or .

IDENT . Denotes an identier ( IDENT = (A -Z , a-z , 0-9 )).

INTEGER . Is an integer value (INTEGER = (0 . 9) + ).

REAL . Is a real value (REAL = (0-9)+ [.(0-9)]).

principal -> Relation-> Attributes

-> Inputs-> Outputs-> Data

Relation -> "@relation" IDENT



26/32

API Dataset Data les grammar

Attributes -> "@attribute" IDENT attributeType Attributes-> {}

attributeType -> "integer" intBoundaries-> "real" realBoundaries-> "{" IDENT idList "}"

intBoundaries -> "[" INTEGER "," INTEGER "]"-> {}

realBoundaries -> "[" REAL "," REAL "]"

-> {}

idList -> "," IDENT idList-> {}

Inputs -> "@inputs" IDENT idList-> {}

Outputs -> "@outputs" IDENT idList-> {}

Data -> @data dataList

dataList -> lineData dataList-> {}

lineData -> IDENT lineDataCont

lineDataCont -> "," IDENT lineDataCont2

lineDataCont2 -> "," IDENT lineDataCont2-> {}



27/32

API Dataset Semantic restrictions of the data les

7.2 Semantic restrictions of the data les

7.2.1 Attributes

An attribute can be dened as integer, real or nominal, as the grammar of the data les denes. It is optional to dene de minimum and maximumvalues, or the list of values for any attribute (if they are not dened, theycorrect values will be extracted during the processing of the training le).Anyway, if they are dened for integer or real attributes, the minimum valuedened must be lower than the maximum.

This way, the limits of the values for any attribute will be establishedduring the processing of the training le. However, it is possible to nd

values in the test le which exceed the limits for a concrete attribute (i.e., insome schemes of cross-validation). Depending of the type of the attribute,the actions performed by the API dataset are the following:

Integer or Real attributes: The new value is changed by its nearestcorrect value (i.e, if the value is greater than the maximum, it is re-placed by the maximum; if it is lower than the minimum, it is replaced by this one)

Nominal attributes: The new value is accepted, and the domain of the attribute is enlarged, adding the new value. In addition, the agnewValueInTest is marked on.

Finally, if one of these cases appears, the API Dataset throws a Test-DataBoundsExcedeedException to inform about the changes performed. How-ever, the les will be parsed correctly.

7.2.2 Inputs and outputs denition

The denition of inputs and outputs in the data les is optional. The APIDataset will automatically extract the missing denitions, following theserules:

If no outputs are dened:

If no input are dened, the last attribute is taken as output. Theremaining ones will be taken as inputs.


28/32

API Dataset Semantic restrictions of the data les

If there are some inputs dened, the attributes not marked asinputs will be taken as outputs.

If no inputs are dened, the attributes not marked as outputs will betaken as inputs.

If inputs and outputs are dened, those attributes who are not cur-rently dened in one of these categories, are discarded.

Also, it is important to note that the inputs and outputs attributes will bedened in the same order as they appear in the header of the data le.

7.2.3 Missing values

The API Dataset allows the presence of missing values in the data les,dened with the < null > or ? tokens. However, only input attributes canpresent missing values. If a missing value is detected in an output attribute,a OutputValueNotKnownException will be cast, aborting the processing of thedata le.

7.2.4 Train and test les

The semantic verications performed by the API Dataset will vary depend-

ing on the concrete data le processed. Concretely, the actions performedare:

The denition of the attributes is taken from the training le.

During the test le reading, the denitions of the attributes are checked.If they are not consistent with the ones read from the training le, theprocessing of the test le is aborted. Moreover, the inputs and outputsdened by the test le must be the same which were dened by thetraining le. Otherwise, the processing of the test le will be aborted.


29/32

API Dataset Description of the classes

7.3 Description of the classes

The API Dataset is composed by four main classes:

InstanceSet: This class contains a complete set of instances dening adata base.

Instance: This class represents a single instance.

Attributes: This static class contains denitions about every attributeof the data contained in the Instance set.

Attribute: This class contains relevant information about a single

attribute.

The next subsections will describe their main characteristics.

7.3.1 InstanceSet

This class contains a complete set of instances. Its public methods are:

numInstances . Returns the number of instances of the Instance Set.

getInstance . Returns a concrete instance contained in the Instance Set.

getInstances . Returns an array with all the instances of the InstanceSet.

7.3.2 Instance

The objects of this class represents instances of the data sets. Its pubicmethods are:

getInputRealValues . Returns an array containing all the input valuesof the instance (only the positions with INTEGER or REAL attributes

values will produce a value). getInputNominalValues . Returns an array containing all the input

values of the instance (only the positions with NOMINAL attributesvalues will produce a value).



30/32


getInputMissingValues . Returns a boolean array dening which in-put values are missing.

getInputRealValue . Returns the value of a concrete input attribute(only the positions with INTEGER or REAL attributes values willproduce a value).

getInputNominalValue . Returns the value of a concrete input at-tribute (only the positions with NOMINAL attributes values will pro-duce a value).

getInputMissingValue . Returns a boolean value dening if the inputvalue is missing.

getOutputRealValues . Returns an array containing all the outputvalues of the instance (only the positions with INTEGER or REALattributes values will produce a value).

getOutputNominalValues . Returns an array containing all the outputvalues of the instance (only the positions with NOMINAL attributesvalues will produce a value).

getOutputMissingValues . Returns a boolean array dening whichoutput values are missing.

getOutputRealValue . Returns the value of a concrete output attribute(only the positions with INTEGER or REAL attributes values willproduce a value).

getOutputNominalValue . Returns the value of a concrete outputattribute (only the positions with NOMINAL attributes values willproduce a value).

getInputMissingValue . Returns a boolean value dening if the out-put value is missing.

getAllInputValues . Returns an array containing all the input values.REAL values are returned as double values. INTEGER values arecasted to double. NOMINAL values are transformed to INTEGER and

casted to double. getAllOutputValues . Returns an array containing all the output val-

ues. REAL values are returned as double values. INTEGER valuesare casted to double. NOMINAL values are transformed to INTEGERand casted to double.



31/32


7.3.3 Attributes

Attributes is an static class which stores the denitions of the attributesrepresented in the data set. It contains an array of Attribute objects, and twoadditional arrays storing references about the input and output attributes.The order of the attributes stored is the same order than it was found in theinput data le.

Its public methods are:

getInputAttributes . Returns an array containing all the input At-tributes.

getOutputAttributes . Returns an array containing all the output At-tributes.

getInputAttribute . Returns a single input attribute.

getOutputAttribute . Returns a single output attribute.

getAttribute . Returns a single attribute, dened neither as input noras output attribute.

getNumInputAttributes . Returns the number of input attributes.

getNumOutputAttributes . Returns the number of output attributes.

getNumAttributes . Returns the number attributes, including input,output and undened ones.

7.3.4 Attribute

The Attribute class contains the denition an attribute of the dataset. Itspublic methods are:

getType . Returns a integer value dening the type of the attribute (thetype is dened as NOMINAL, INTEGER or REAL).

getName . Returns the name of the attribute.

getMinAttribute . Returns the minimum value of the attribute (onlyavailable in INTEGER or REAL attributes).



32/32


getMaxAttribute . Returns the minimum value of the attribute (onlyavailable in INTEGER or REAL attributes).

getNominalValuesList . Returns an array with all the values denedfor the attribute (only available in NOMINAL attributes).

convertNominalValue . Converts a nominal value to its representa-tions as integer (an integer between [0. . . N-1, where N is the numberof values dened for the attribute).

getDirectionAttribute . Returns an integer showing if the attributeis dened as input attribute (INPUT), output attribute (output), orundened (DIR NOT DEF)

getNewValuesInTest . Returns n array with the new values of theattribute observed in test data.


keelreferencemanualv1.0

Documents