336vd report franed1 update language

4
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1 Classification of newborn’s sleeping phases from their EEG. Dominik Franˇ ek Abstrakt— Correct classification of newborn’s sleeping phases from their EEG can help to predict the problems on brain or other mental defects. This semestral work has been disposed to find optimal k in nearest neighbor classifier. The choice of kNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. The best k in nearest neighbor classifier was figured up for the value 3, with accuracy 83.69%.It means each time newborn’s EEG will be given the algorithm can classify sleeping phases of this newborn by choosing 3 other nearest EEG records. I. ASSIGNMENT Use the method of k Nearest Neighbors for classification of the target attribute of chosen dataset. Chose one of the classes as target class - positive. Find the best classifier, which has False Positive rate (FPr)< 0.3. Count the accuracy and True Positive rate TPr of this classificator. II. I NTRODUCTION The problem is to find optimal k in Nearest Neighbor classificator (next time will be written as NN) for given dataset. The algorithm can be briefly summarized as follows: In the training phase, it computes the similarity measures from all rows in the training set and combines them in a global similarity measure using the XValidation method. In the testing phase, for a rows with “unknown“ classes, it chooses their k nearest neighbors in the training set according to the trained similarity measure and then uses a customized voting scheme to generate a list of predictions with confidence scores [4]. Dataset is in *.arff format and each row has 55 attributes. Attribute called ”class“ has 4 nominal values (0,1,2,3) and it represents the classified new-born’s sleeping phases. I didn’t find anywhere what does it mean exactly, but from my observations I expect it means that from given attributes (EEG c1 alpha,...) can be computed what kind of sleeping these values of attributes represent. [5] The given dataset is preprocessed a little bit. There aren’t any rows with zero attributes and all attributes are numerical values. In Fig.1 are shown all attributes of dataset and their values. These values are not normalized so the range of attribute’s values is from -5 to 543. The normalized dataset is on Fig.2, where all values are in the range from 0.0 to 1.0. Each class (0, 1, 2, 3) has different color. The dataset is too big to process it at once, because it sets up of 42027 rows each with 55 regular attributes. III. EXPERIMENTS The chosen positive class of original data is class 0 (In normalized subset renamed to class 1). The other classes (1,2,3) are set as negative classes. Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis: values of attributes (-5 to 543) Fig. 2. Graph showing normalized values of attributes; x-axis: attributes, y-axis: values of attributes (0.0 to 1.0) It’s sure the dataset has to be preprocessed before starting experiments. First operator Normalization is used (Showed on Fig.5, page 3) which normalizes all numerical values to range from 0.0 to 1.0. The optimization of extreme values won’t be done because in next part of preprocessing will be chosen just 1 70 of all rows (2942 rows) and extreme values will be “eliminated“. For choosing this subset method of Stratified Sampling is used and as label attribute named ”class“ is set attributed. From 2942 chosen rows 2210 are labeled as class 0 and 732 as class 1 (Tab.I). Class 0 is merged from original classes 1,2 and 3. Class 1 is renamed from original class 0. Normalized datasubset is shown on Fig.3 After attributes normalization the phase of training the model begins. As shown in Fig.5 (right side) the normalized subset is divided into 2 parts. 1 5 of this subset goes to training phase and 4 5 are used for testing. In training phase operator Parameter Iteration is used for iterating the k of NN. K is iterated from 1 to 15 incresing by +1. To avoid overfitting of NN method called K-fold cross

Upload: fdominik

Post on 28-Nov-2014

660 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 336vd Report Franed1 Update Language

SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1

Classification of newborn’s sleeping phases fromtheir EEG.

Dominik Franek

Abstrakt— Correct classification of newborn’s sleeping phasesfrom their EEG can help to predict the problems on brain orother mental defects. This semestral work has been disposed tofind optimal k in nearest neighbor classifier. The choice of kNNis motivated by its simplicity, flexibility to incorporate differentdata types and adaptability to irregular feature spaces. The bestk in nearest neighbor classifier was figured up for the value 3,with accuracy 83.69%.It means each time newborn’s EEG will begiven the algorithm can classify sleeping phases of this newbornby choosing 3 other nearest EEG records.

I. ASSIGNMENT

Use the method of k Nearest Neighbors for classification ofthe target attribute of chosen dataset. Chose one of the classesas target class - positive. Find the best classifier, which hasFalse Positive rate (FPr)< 0.3. Count the accuracy and TruePositive rate TPr of this classificator.

II. INTRODUCTION

The problem is to find optimal k in Nearest Neighborclassificator (next time will be written as NN) for given dataset.

The algorithm can be briefly summarized as follows: Inthe training phase, it computes the similarity measures fromall rows in the training set and combines them in a globalsimilarity measure using the XValidation method. In the testingphase, for a rows with “unknown“ classes, it chooses their knearest neighbors in the training set according to the trainedsimilarity measure and then uses a customized voting schemeto generate a list of predictions with confidence scores [4].

Dataset is in *.arff format and each row has 55 attributes.Attribute called ”class“ has 4 nominal values (0,1,2,3) and itrepresents the classified new-born’s sleeping phases. I didn’tfind anywhere what does it mean exactly, but from myobservations I expect it means that from given attributes(EEG c1 alpha,...) can be computed what kind of sleepingthese values of attributes represent. [5]

The given dataset is preprocessed a little bit. There aren’tany rows with zero attributes and all attributes are numericalvalues. In Fig.1 are shown all attributes of dataset and theirvalues. These values are not normalized so the range ofattribute’s values is from −5 to 543. The normalized datasetis on Fig.2, where all values are in the range from 0.0 to 1.0.Each class (0, 1, 2, 3) has different color. The dataset is too bigto process it at once, because it sets up of 42027 rows eachwith 55 regular attributes.

III. EXPERIMENTS

The chosen positive class of original data is class 0 (Innormalized subset renamed to class 1). The other classes(1,2,3) are set as negative classes.

Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis:values of attributes (−5 to 543)

Fig. 2. Graph showing normalized values of attributes; x-axis: attributes,y-axis: values of attributes (0.0 to 1.0)

It’s sure the dataset has to be preprocessed before startingexperiments. First operator Normalization is used (Showed onFig.5, page 3) which normalizes all numerical values to rangefrom 0.0 to 1.0. The optimization of extreme values won’tbe done because in next part of preprocessing will be chosenjust 1

70 of all rows (2942 rows) and extreme values will be“eliminated“. For choosing this subset method of StratifiedSampling is used and as label attribute named ”class“ is setattributed. From 2942 chosen rows 2210 are labeled as class0 and 732 as class 1 (Tab.I). Class 0 is merged from originalclasses 1,2 and 3. Class 1 is renamed from original class 0.Normalized datasubset is shown on Fig.3

After attributes normalization the phase of training themodel begins. As shown in Fig.5 (right side) the normalizedsubset is divided into 2 parts. 1

5 of this subset goes to trainingphase and 4

5 are used for testing.In training phase operator Parameter Iteration is used for

iterating the k of NN. K is iterated from 1 to 15 incresing by+1.

To avoid overfitting of NN method called K-fold cross

Page 2: 336vd Report Franed1 Update Language

SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 2

Attr. name Statistics Range

class label 0.0 (2210), 1.0 (732)

PNG 0.427 +/- 0.111 [0.000 ; 0.919]

PNG filtered 0.359 +/- 0.166 [0.000 ; 1.000]

EMG std 0.114 +/- 0.090 [0.033 ; 0.766]

EMG std filtered 0.126 +/- 0.138 [0.004 ; 0.874]

ECG beat 0.427 +/- 0.135 [0.212 ; 0.993]

ECG beat filtered 0.444 +/- 0.138 [0.225 ; 0.987]

EEG fp1 delta 0.216 +/- 0.065 [0.081 ; 0.964]

EEG fp2 delta 0.218 +/- 0.067 [0.071 ; 0.958]

EEG t3 delta 0.202 +/- 0.074 [0.064 ; 0.906]

EEG t4 delta 0.232 +/- 0.089 [0.062 ; 0.956]

EEG c3 delta 0.243 +/- 0.072 [0.091 ; 0.961]

EEG c4 delta 0.244 +/- 0.070 [0.089 ; 0.968]

EEG o1 delta 0.212 +/- 0.077 [0.066 ; 0.958]

EEG o2 delta 0.211 +/- 0.083 [0.046 ; 0.933]

EEG fp1 theta 0.188 +/- 0.072 [0.068 ; 0.976]

EEG fp2 theta 0.216 +/- 0.075 [0.090 ; 0.972]

EEG t3 theta 0.222 +/- 0.065 [0.077 ; 0.970]

EEG t4 theta 0.264 +/- 0.079 [0.082 ; 0.938]

EEG c3 theta 0.308 +/- 0.061 [0.101 ; 0.962]

EEG c4 theta 0.299 +/- 0.060 [0.098 ; 0.960]

EEG o1 theta 0.219 +/- 0.067 [0.080 ; 0.922]

EEG o2 theta 0.271 +/- 0.079 [0.080 ; 0.931]

EEG fp1 alpha 0.112 +/- 0.077 [0.043 ; 0.981]

EEG fp2 alpha 0.124 +/- 0.081 [0.046 ; 0.956]

EEG t3 alpha 0.158 +/- 0.080 [0.055 ; 0.946]

EEG t4 alpha 0.181 +/- 0.082 [0.055 ; 0.928]

EEG c3 alpha 0.249 +/- 0.070 [0.088 ; 0.943]

EEG c4 alpha 0.246 +/- 0.069 [0.085 ; 0.957]

EEG o1 alpha 0.116 +/- 0.066 [0.039 ; 0.910]

EEG o2 alpha 0.151 +/- 0.066 [0.048 ; 0.935]

EEG fp1 beta1 0.114 +/- 0.079 [0.043 ; 0.985]

EEG fp2 beta1 0.123 +/- 0.083 [0.046 ; 0.943]

EEG t3 beta1 0.152 +/- 0.084 [0.045 ; 0.957]

EEG t4 beta1 0.168 +/- 0.087 [0.053 ; 0.930]

EEG c3 beta1 0.234 +/- 0.077 [0.092 ; 0.942]

EEG c4 beta1 0.226 +/- 0.074 [0.079 ; 0.949]

EEG o1 beta1 0.091 +/- 0.070 [0.028 ; 0.916]

EEG o2 beta1 0.129 +/- 0.070 [0.041 ; 0.970]

EEG fp1 beta2 0.217 +/- 0.081 [0.086 ; 0.990]

EEG fp2 beta2 0.211 +/- 0.076 [0.083 ; 0.958]

EEG t3 beta2 0.189 +/- 0.070 [0.063 ; 0.927]

EEG t4 beta2 0.226 +/- 0.083 [0.065 ; 0.922]

EEG c3 beta2 0.248 +/- 0.066 [0.092 ; 0.960]

EEG c4 beta2 0.246 +/- 0.065 [0.090 ; 0.966]

EEG o1 beta2 0.230 +/- 0.085 [0.076 ; 0.958]

EEG o2 beta2 0.220 +/- 0.080 [0.055 ; 0.932]

EEG fp1 gama 0.154 +/- 0.073 [0.058 ; 0.976]

EEG fp2 gama 0.172 +/- 0.076 [0.075 ; 0.956]

EEG t3 gama 0.196 +/- 0.069 [0.067 ; 0.958]

EEG t4 gama 0.227 +/- 0.078 [0.071 ; 0.897]

EEG c3 gama 0.289 +/- 0.063 [0.097 ; 0.959]

EEG c4 gama 0.281 +/- 0.061 [0.095 ; 0.959]

EEG o1 gama 0.168 +/- 0.065 [0.062 ; 0.915]

EEG o2 gama 0.237 +/- 0.077 [0.072 ; 0.912]

TABLE ISTATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET

Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis:attributes, y-axis: values of attributes (0.0 to 1.0)

validation (CV) is used. For each iteration of k CV is run10 times. CV divides training set 10 times into 2 parts. CVtrains kNN on the first part and validates kNN with the secondpart. After 10 iterations of K-fold cross validation the averageaccuracy of kNN for these 10 CV is computed. After k isiterated from 1 to 15, k with the highest average accuracyis selected and will be used in testing phase. Graph with theaverage accuracies for each k NN is on Fig.4.

Fig. 4. Average accuracy for kNN; x-axis: k NN; y-axis: accuracy;

IV. METHODOLOGY

A. Used tool

Tool used is called RapidMiner (v4.0) [1]. Using Rapid-Miner allows user to make all phases of DataMining in thistool. It detracts from familiarization with only one environ-ment. All operators used in this work are accesible from thebasic version of RapidMiner.

B. Configuration

By combining many operators in RapidMiner the project isbuilt. The complete tree view of operators used to get the bestk in Nearest Neighbor classification is showed on Fig.5.

• All operators has local random seed set to -1. Just theRoot operator has value 2001, because then the randomoperations generates the same values. If an operator hassampling type then it is set to stratified sampling..

• SplitChain operator has split ratio set to 0.2.

Page 3: 336vd Report Franed1 Update Language

SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 3

• XValidation operator has number of validations set to10 and measure set to Euclidean Distance.

• NearestNeighbor trying k operator has k set to 15 but thisparameter is influenced by Iterating k - training operator.

• ClassificationPerformance (1) operator has checked ac-curacy.

• NearestNeighbor defined k operator has k set to 3 andmeasure set to Euclidean Distance.

• ClassificationPerformance (2) operator has checked ac-curacy.

• BinominalClassificationPerformance operator haschecked fallout.

• ProcessLog operator logs the accuracy from Classifica-tionPerformance.

Fig. 5. “Box view” of complete project from DataMiner

C. Experiments setup

The Nearest Neighbor classification uses Euclidean distanceto compute the kNN. In human words it can be translated as“Finds the closest point xi to xj”. Euclidean distance between

xi and xj(j = 1, 2, ...n) is defined as:

d(xi, xj) =√

(xi1 − xj1)2 + (xi2 − xj2)2 + ...+ (xin − xjn)2

The Algorithm NN can be built as: [6]

• Training phase: Build the set of training examples T .• Testing phase:

– Is given a query instance xq to be classified– Let x1...xn denote the k instances from T that are

nearest to xq

F (xq) = argmax

n∑i=1

δ(v, f(xi))

The best k in Nearest Neighbor classificator is found byiterating k from 1 to 15. Top value 15 is chosen as enough. Ineach iteration the Operator “ClassificationPerformance” countsaccuracy of given k. The Operator ProcessLog writes resultsof “ClassificationPerformance” and generates report (Fig.4).From the report it stands for reason, that the best k is 3.

True 0 True 1Pred 0 1609 225Pred 1 159 361

TABLE IINN CLASSIFICATION FOR k := 3; accuracy = 83.69%, FPr = 8.99%,

TPr = 61.60%

Positive class is selected as class with value = 1. For k = 3the accuracy was 83, 69% as shown in Tab.II. False Positiverate (FPr) of this classicator is 8.99%.

159/(159 + 1609) = 0.0899

There is also evidently, that True Positive rate (TPr) is61.60%. Because there are 586 examples with class=1 andjust 361 of them were classified correctly.

361/(361 + 225) = 0.616

V. DISCUSSION

False Positive rate seems to be very good. Maybe it seemsto be very low but it’s probably by the big subset of trainingdata.

To discuss is, if the rate = 0.2 dividing datasubset to trainingand testing part is set correctly or not. With faster computercan be set oposite rate = 0.8. In my opinion 584 trainingexamples were enough and FPr declares the rate wasn’tchosen so badly. On the other hand TPr is 61.60% whichis not much and can be easily higher/lower influencing FPr.

Next question can be if the algorithm shouldn’t count theweighted kNN. I was trying to find any dependencies betweenthe 55 attributes but I wasn’t successful. So I don’t thinksetting weights on attributes would be helpful.

Page 4: 336vd Report Franed1 Update Language

SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 4

VI. CONCLUSION

In my opinion I found very good classifier for the subsetof dataset. There can be done some improvments of thisalgorithm to work better. I think it would be useful if suchclassifier should be used in practices but just for school workit’s not so important.

The hardest part of the work was exploring operators inRapidMiner and finding the right one I needed. I know thereare still some which can be replaced by better operators, butthis solution was working and what more it gave good results.Most of the time I spent on waiting until RapidMiner willprocess all the operators with given dataset. Unfortunatelythis programm is written in Java, what is not language forscientific computing and I had to restart java because it gotout of memory quite often. The most interesting part for mewas generating graphs and writing this report ¨ .

I am very satisfied I finished the work and I can say that Ilearnt a lot about datamining and about classifing any dataset.I am afraid anybody can feel from this work that my futurespecialization will be Software Engineering and such scientificwork is not my cup of tea.

REFERENCES

[1] CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man-ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Availablefrom WWW: <http://os.cqu.edu.au/oswins/datamining/rapidminer/rapidminer-4.0beta-guimanual.pdf>.

[2] FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations[online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW:<http://www.citace.com/>.

[3] LAURIKKALA, Jorma. Improving Identification of Difficult SmallClasses by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART-MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER-SITY OF TAMPERE . Report. Available from WWW: <http://www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951-44-5093-0.

[4] KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit.2008-05-08]. Available from WWW: <http://people.revoledu.com/kardi/tutorial/KNN/>.

[5] POBLANO, Adrian and GUTIERREZ, Roberto. Correlation betweenthe neonatal EEG and the neurological examination in the first year oflife in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online].2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Availablefrom: <http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0004-282X2007000400005&lng=en&nrm=iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005

[6] SOLOMATINE, D.P. Instance-based learning and k-Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008-05-10]. EN. Available from WWW: <http://www.xs4all.nl/˜dpsol/data-machine/nmtutorial/instancebasedlearningandknearestneighboralgorithm.htm>.

[7] VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine LearningCourse [online]. [2008] [cit. 2008-05-08]. EN. Available from WWW:<http://www.cmla.ens-cachan.fr/Membres/vayatis/teaching/cours-de-machine-learning-ecp.html>.

[8] XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning.CS 540: Introduction to Artificial Intelligence [online]. 2005 [cit. 2008-05-08]. Available from WWW: <http://pages.cs.wisc.edu/

˜jerryzhu/cs540/knn.pdf>.[9] van den BOSCH, Antal.Video: K-nearest neighbor classification [online].

Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW:<http://videolectures.net/aaai07_bosch_knnc/>.