comparative study of data mining ...comparative study of data mining methodologies for prediction of...

COMPARATIVE STUDY OF DATAMINING METHODOLOGIES FORPREDICTION OF PARKINSONSDISEASE BY STATISTICAL

METHODS

M.S.Roobini1,M.Lakshmi2,1Research Scholar, Computer Science,

Sathyabama Institute of Science and Technology2Dean, School of Computing,

Sathyabama Institute of Science and Technology

July 22, 2018

Abstract

Nowadays Data Mining plays a very vital role in the fieldof Biomedical which is mainly used for prediction and Diag-nosis of diseases. Parkinson disease is a neurodegenerativedisorder which becomes one of the major challenges to thedoctors and researchers in the current society.The predictionof this Parkinsons disease are very essential for a healthyenvironment. This study provides knowledge about someData Mining Techniques to understand the diagnosis andalso for the Classification and Prediction of Parkinsons Dis-ease.The source of the disease was not exactly predictable, itis necessary to predict Parkinson disease before the severitylevel. Many data mining algorithms are applied to the se-lected dataset for classification and prediction. Before thatPre-processing was done to remove the missing values. ThePre-processed data was undergone for further classificationand Prediction of the Parkinson Disease.

1

International Journal of Pure and Applied MathematicsVolume 120 No. 6 2018, 9475-9487ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

9475

1 Introduction

Nowadays, data mining plays a very vital role in health sector andalso in many other applications. The recent trends in data miningmethodologies are very much useful to the researches and as well asthe society. A severe thread to the people nowadays is the increasein the number of Parkinson Disease. It increases in number dayby day. Parkinson is a neurodegenerative disorder which affectsthe brain of the people who are suffering from Parkinson Disease.Parkinson id an idiopathic disease, hence it is a great thread tothe society, so the early prediction of the disease reduces the deathrate. This disease is caused by the problem the central nervoussystem of the brain. The Data mining methodologies which areapplied here are helpful in classifying the normal people from theParkinson affected people. The deficiency or death of Dopaminecells in the midbrain are one of the reasons for the cause of Parkin-son Disease. The Architecture which are applied here are DataUnderstanding, Data Preprocessing, Modeling and Data Analysis.Some of the symptoms of Parkinson are abnormal speech and be-haviors, difficulty in motor and non-motor activities, misplacingthings and poor decision making etc. From the literature surveyit was studied that various data mining methods have been usedfor the diagnosis and prediction of Parkinson disease, but the ef-ficiency of the result is very important for predicting the disease.Various kinds if patients information has been updated in medicaldatabases day by day, so the major issue is in the efficiency of thedataset. The dataset may contain noisy data, missing values, in-complete data etc. Such kinds of issues can be overcome by usingvarious preprocessing steps. Various Machine Learning methodolo-gies have been used for the interpretation of datasets. A datasetfrom UCI depository has been taken for the analysis. Once the pre-processing gets over, the preprocessed data are further analyzed forclassification and prediction. Prediction algorithms such as supportvector machine, Logistic regression and Random forest are used forthe analysis. The algorithm which gives the efficient accuracy wasfound out.

2

International Journal of Pure and Applied Mathematics Special Issue

9476

2 Literature Review

In this section, the details of Logistic Regression, Support VectorMachines and Random forest have been discussed.

2.1 Regression

According to statistical modeling, regression analysis is the pro-cess of estimating the relationship between variables. It consists ofmultiple techniques for analyzing several variables and modelling.It helps in understanding the relationship between the dependentvariable and the other independent variable.

2.1.1 Logistic Regression

Logistic regression is a method for analyzing a database in whereone or more independent variables that is used to determine an out-come. Logistic regression gives basic overview of features impactingand binary output with probability response and p values.

3


9477

Figure 1. A logistic regression line

The outcome is binary in nature (i.e TRUE or FALSE). This modelis used to describe the relationship between the binary characteris-tic of interest and a set of independent variables.

2.2 Random Forest Method

Random Forest is an algorithm which is based on statistical learn-ing theory, which uses Bootstrap randomized re-sampling way toextract multiple versions of the sample sets from the original train-ing datasets, then building a decision tree model for each sample set,the final combined all the results of the decision trees to predict theresults of classification by the established voting mechanism. Sincethe severity of the disease leads to high death rate in the society,this disease should be predicted earlier with effective machine learn-ing algorithms. It is an extension of the decision tree algorithm asit creates multiple number of decision trees at the time of creationof the model and outputs the class which is the mode of the classes(if its a classification) or mean prediction (if its a regression) ofthe individual trees. For the previous surveys it was studied thatRandom Forest gives better efficiency in predicting diseases whencompared to others since it handles and analyses all combinationsof variables in a dataset. Example diagram for the Random Forestis given below.

4


9478

Figure 2. Random Forest

2.3 Support Vector Machine (SVM)

A support vector machine helps in classification; regression etc. byconstructing a set of hyper planes in infinite dimensional space.SVMis useful for multiple applications like text categorization, classifi-cation of images etc.

Figure 3. Support Vector Machine

5


9479

3 Dataset Description

This dataset is consists range of biomedical measurements from 31people, 23 with Parkinson’s disease (PD). Each column in the tableis a measure, and each row corresponds one of 195 voice recordingfrom these individuals. The aim of the dataset is to differentiatehealthy people from other people with Parkinson Disease. Thereare around six recordings per patient; the name of the patient isidentified in the first column).

Figure 4. Raw Dataset

Figure 5. Correlation Matrix

Attribute InformationMatrix column entries (attributes): name - ASCII subject nameand recording numberMDVP:Fo(Hz) - Average vocal fundamental frequencyMDVP:Fhi(Hz) - Maximum vocal fundamental frequencyMDVP:Flo(Hz) - Minimum vocal fundamental frequencyMDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequencyMDVP:Shimmer,MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:

6


9480

APQ5, MDVP:APQ,Shimmer:DDA - Several measures of variationin amplitudeNHR, HNR - Two measures of ratio of noise to tonal componentsin the voice status - Health status of the subject (one)- Parkinson’s,(zero) healthyRPDE, D2 - Two nonlinear dynamical complexity measuresDFA - Signal fractal scaling exponent spread1, spread2,PPE - Three nonlinear measures of fundamental frequency varia-tion.

Figure 6. Scatter Plot

3.1 Data Preprocessing

Before applying the Machine Learning algorithms for classificationand prediction, the dataset which was selected was preprocessed toremove the irrelevant data or to find out the missing values in thedataset. Finally the preprocessed efficient data was further ana-lyzed for the prediction of diseases..

3.2 Exploratory Data Analysis

The distribution and standard deviation of the dataset are takenfrom the dataset. The asymmetry in the distribution of the data isfound out and stored in the skew array, A positive skew value canbe observed from the data for all columns ,and a lot of plots aredone based on the correlation matrix obtained from the dataset.

7


9481

Figure 7. Logistic Regression Graph Model

3.3 Classification and Prediction Modeling

Classification models are to determine categorical class labels whichis done in the following steps Data cleaning, Analysis, Data Trans-formation and Data reduction. The dataset which is selected arepreprocessed and divided into 2 sets, training and test data. Ma-chine Learning algorithm likes Support vector machine, LogisticRegression, and Random forest are applied. Logistic Regressionshowed an accuracy of 0.9172205.Support Vector Machine (SVM)is a machine learning algorithm which is used in classifying Parkin-son disease from high dimensional medical dataset. The RandomForest algorithm is good in handling high-dimensional data andalso good in finding out the missing values. Random Forest afteranalysis gave an higher accuracy of 0.9204355 compared to SupportVector Machine which gives a value of 0.9114924.

8


9482

Figure 8. Logistic Regression graph

4 Conclusion

Parkinson is a neurodegenerative disorder which affects the motoractivities of the person affected by this disease. Parkinson diseaseshould be identified earlier, so that the number of patients affectedby this disease will be reduced. Since there is no medicine to curethe disease completely precaution of the disease is the only option.A Prediction model is designed which is helpful for the predictionof Parkinson by getting the normal physical data from the dataset.The dataset has been preprocessed for getting better efficiency inresults. Classification and Prediction Methodologies are appliedto the preprocessed dataset. From the result it is concluded thatmost efficient algorithms gives better accuracy in prediction whencompared to other algorithms.

5 Related Work

• P. Suganya and C. P. Sumathi et al., made a study which pro-vides knowledge about the techniques for the effective classificationof Parkinsons Disease. It adopted a data mining algorithm for thedetection and classification of Parkinson Disease .Here 195 instanceshad been selected for the investigation. It undergoes five phases,

9


9483

which includes training dataset, data pre-process, feature selection,classification and evaluation. Various attributes like Specificity,Sensitivity, Accuracy and Positive and Negative predictive valuesare also evaluated. This study also performs a comparative studyon five classification algorithms. The comparison results of the se-lected algorithm supports identification of specificity, accuracy andsensitivity performance measures. This paper estimates the effi-ciency and also the efficacy of the selected algorithm to detect theParkinson Dataset using various classifiers. The study shows thatABO algorithm has 97 percent accuracy for classification and fil-tering of features.• In this paper, minimum redundancy maximum relevance featureselection algorithms is used for selecting the very important featurewhen compared to all other features for predicting the Parkinsondisease. From this paper, it was observed that the random for-est algorithm which contains 20 features are selected by minimumredundancy maximum relevance feature selection algorithms whichprovides the overall accuracy of 90.3%, which was better when com-pared to all other machine learning methods like random forest,rotation forest, random subspace, bagging, boosting ,SVM, and de-cision tree.• Dr. Hariganesh S, Gracy Annamary S et al., made a survey aboutvarious Data Mining Methods which are used for the diagnosis ofParkinson Disease. It discussed about some Classification Tech-niques such as Random Forest, MLP Network, and Neural Networketc. Among all the algorithms considered Random Forest showshigh accuracy for prediction of Parkinson.• This study describes and evaluates the discriminative ability ofdata mining algorithms for classification of Parkinson cases and alsouses cross validation approach for comparing various data miningalgorithms for determining which approach provides the most effi-cient result. The predictive accuracy of the data mining model isverified. This proposed methodology demonstrates the feasibility ofusing data mining models for monitoring and diagnosis of emergingneurological diseases.• Tarigoppula V.S Sriram,, M. Venkateswara Rao et al., uses toolssuch as Orange and weka have been used for the statistical analy-sis, classification, and unsupervised learning methods. It uses thedataset of Voice for Parkinson disease. By implementing the dataset

10


9484

it was found out that SVM have shown good accuracy of 88.9% com-pared to Majority and k-NN algorithms. Random Forest algorithmhad shown good accuracy of 90.26 and Nave Bayes had shown ac-curacy of 69.23.• Here, the pathological findings of 100 patients are diagnosed andreported as having Parkinson’s disease. The dataset of the peoplewho already have Parkinson was found out, especially the voicedata of Parkinson disease patient were collected and by recordingtheir voices. The results of the analysis taken from the patientswere analyzed by using some data mining techniques such as BayesNet which shows 70% accuracy, nave bayes which shows 80% accu-racy and KStar and AD Tree which shows 100% accuracy.• In this paper, the performance of data mining techniques in neuro-degenerative data was discussed. The proposed Feature SelectionMethod had gives an accuracy of 93% in prediction of Parkinson.

References

[1] P. Suganya and C. P. Sumathi, A Novel Metaheuristic DataMining Algorithm for the Detection and Classification ofParkinson Disease, Indian Journal of Science and Technology,Vol 8(14), DOI: 10.17485/ijst/2015/v8i14/72685, July 2015.

[2] Arvind Kumar Tiwari, Machine Learning Based Approachesfor Prediction of Parkinsons Disease, Machine Learning andApplications: An International Journal (MLAIJ) Vol.3, No.2,and June 2016.

[3] Dr. Hariganesh S , Gracy Annamary S, A Survey of Parkin-sons Disease sing Data Mining Algorithms, Hariganesh S etal, / (IJCSIT) International Journal of Computer Science andInformation Technologies, Vol. 5 (4) , 2014, 4943-4944.

[4] A Data Mining Methodology for Predicting early stage Parkin-sons disease using non-invasive, high dimensional gait sensordata, NSF I/UCRC Center for Healthcare Organization Trans-formation (CHOT), NSF I/UCRC grant 1067885.

11


9485

[5] Tarigoppula V.S Sriram, M. Venkateswara Rao, G V SatyaNarayana, DSVGK Kaladhar,T Pandu Ranga Vital, IntelligentParkinson Disease Prediction Using Machine Learning Algo-rithms, International Journal of Engineering and InnovativeTechnology (IJEIT), volume3,issue3,September 2013.

[6] Tarigoppula V.S.Sriram,M. enkateswara Rao, G.V.SatyaNarayana, and D.S.V.G.K. Kaladhar, ParkDiag: A Tool toPredict Parkinson Disease using Data Mining Techniques fromVoice Data, International Journal of Engineering Trends andTechnology (IJETT) Volume 31 Number 3- January 2016.

[7] Tejeswinee.K, Shomona Gracia Jacob, Athilakshmi, FeatureSelection Techniques for Prediction of Neuro-Degenerative Dis-orders: A Case-Study with Alzheimers And Parkinsons Dis-ease, Feature Selection Techniques for Prediction of Neuro-Degenerative Disorders: A Case-Study with Alzheimers AndParkinsons Disease. International Conference on Advances inElectrical, Electronics, Information.

12


9486

comparative study of data mining ...comparative study of data mining methodologies for prediction of...

Documents